Part I - (Prosper Loan Data Exploration )

by (Hafsah Anibaba)

Introduction

This dataset contains 113937 loansbased on 81 variables. The data was downloaded from the udacity data archive through this link: https://www.google.com/url?q=https://s3.amazonaws.com/udacity-hosted-downloads/ud651/prosperLoanData.csv&sa=D&ust=1581581520570000

Preliminary Wrangling

In [1]:
# I want to import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec
import seaborn as sb

%matplotlib inline

import warnings
warnings.simplefilter("ignore")
In [2]:
# I want to load dataset into variable data

data = pd.read_csv('data/prosperLoanData.csv')
In [3]:
# I want to set maximum number of columns to be displayed to number of columns in dataset
# To make all columns visible for visual assessment

pd.set_option('display.max_columns', data.shape[1])
In [4]:
#I want to display the  first five rows of the data

data.head(5)
Out[4]:
ListingKey ListingNumber ListingCreationDate CreditGrade Term LoanStatus ClosedDate BorrowerAPR BorrowerRate LenderYield EstimatedEffectiveYield EstimatedLoss EstimatedReturn ProsperRating (numeric) ProsperRating (Alpha) ProsperScore ListingCategory (numeric) BorrowerState Occupation EmploymentStatus EmploymentStatusDuration IsBorrowerHomeowner CurrentlyInGroup GroupKey DateCreditPulled CreditScoreRangeLower CreditScoreRangeUpper FirstRecordedCreditLine CurrentCreditLines OpenCreditLines TotalCreditLinespast7years OpenRevolvingAccounts OpenRevolvingMonthlyPayment InquiriesLast6Months TotalInquiries CurrentDelinquencies AmountDelinquent DelinquenciesLast7Years PublicRecordsLast10Years PublicRecordsLast12Months RevolvingCreditBalance BankcardUtilization AvailableBankcardCredit TotalTrades TradesNeverDelinquent (percentage) TradesOpenedLast6Months DebtToIncomeRatio IncomeRange IncomeVerifiable StatedMonthlyIncome LoanKey TotalProsperLoans TotalProsperPaymentsBilled OnTimeProsperPayments ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate ProsperPrincipalBorrowed ProsperPrincipalOutstanding ScorexChangeAtTimeOfListing LoanCurrentDaysDelinquent LoanFirstDefaultedCycleNumber LoanMonthsSinceOrigination LoanNumber LoanOriginalAmount LoanOriginationDate LoanOriginationQuarter MemberKey MonthlyLoanPayment LP_CustomerPayments LP_CustomerPrincipalPayments LP_InterestandFees LP_ServiceFees LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss LP_NonPrincipalRecoverypayments PercentFunded Recommendations InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors
0 1021339766868145413AB3B 193129 2007-08-26 19:09:29.263000000 C 36 Completed 2009-08-14 00:00:00 0.16516 0.1580 0.1380 NaN NaN NaN NaN NaN NaN 0 CO Other Self-employed 2.0 True True NaN 2007-08-26 18:41:46.780000000 640.0 659.0 2001-10-11 00:00:00 5.0 4.0 12.0 1 24.0 3.0 3.0 2.0 472.0 4.0 0.0 0.0 0.0 0.00 1500.0 11.0 0.81 0.0 0.17 $25,000-49,999 True 3083.333333 E33A3400205839220442E84 NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 78 19141 9425 2007-09-12 00:00:00 Q3 2007 1F3E3376408759268057EDA 330.43 11396.14 9425.00 1971.14 -133.18 0.0 0.0 0.0 0.0 1.0 0 0 0.0 258
1 10273602499503308B223C1 1209647 2014-02-27 08:28:07.900000000 NaN 36 Current NaN 0.12016 0.0920 0.0820 0.07960 0.0249 0.05470 6.0 A 7.0 2 CO Professional Employed 44.0 False False NaN 2014-02-27 08:28:14 680.0 699.0 1996-03-18 00:00:00 14.0 14.0 29.0 13 389.0 3.0 5.0 0.0 0.0 0.0 1.0 0.0 3989.0 0.21 10266.0 29.0 1.00 2.0 0.18 $50,000-74,999 True 6125.000000 9E3B37071505919926B1D82 NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 0 134815 10000 2014-03-03 00:00:00 Q1 2014 1D13370546739025387B2F4 318.93 0.00 0.00 0.00 0.00 0.0 0.0 0.0 0.0 1.0 0 0 0.0 1
2 0EE9337825851032864889A 81716 2007-01-05 15:00:47.090000000 HR 36 Completed 2009-12-17 00:00:00 0.28269 0.2750 0.2400 NaN NaN NaN NaN NaN NaN 0 GA Other Not available NaN False True 783C3371218786870A73D20 2007-01-02 14:09:10.060000000 480.0 499.0 2002-07-27 00:00:00 NaN NaN 3.0 0 0.0 0.0 1.0 1.0 NaN 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN 0.06 Not displayed True 2083.333333 6954337960046817851BCB2 NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 86 6466 3001 2007-01-17 00:00:00 Q1 2007 5F7033715035555618FA612 123.32 4186.63 3001.00 1185.63 -24.20 0.0 0.0 0.0 0.0 1.0 0 0 0.0 41
3 0EF5356002482715299901A 658116 2012-10-22 11:02:35.010000000 NaN 36 Current NaN 0.12528 0.0974 0.0874 0.08490 0.0249 0.06000 6.0 A 9.0 16 GA Skilled Labor Employed 113.0 True False NaN 2012-10-22 11:02:32 800.0 819.0 1983-02-28 00:00:00 5.0 5.0 29.0 7 115.0 0.0 1.0 4.0 10056.0 14.0 0.0 0.0 1444.0 0.04 30754.0 26.0 0.76 0.0 0.15 $25,000-49,999 True 2875.000000 A0393664465886295619C51 NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 16 77296 10000 2012-11-01 00:00:00 Q4 2012 9ADE356069835475068C6D2 321.45 5143.20 4091.09 1052.11 -108.01 0.0 0.0 0.0 0.0 1.0 0 0 0.0 158
4 0F023589499656230C5E3E2 909464 2013-09-14 18:38:39.097000000 NaN 36 Current NaN 0.24614 0.2085 0.1985 0.18316 0.0925 0.09066 3.0 D 4.0 2 MN Executive Employed 44.0 True False NaN 2013-09-14 18:38:44 680.0 699.0 2004-02-20 00:00:00 19.0 19.0 49.0 6 220.0 1.0 9.0 0.0 0.0 0.0 0.0 0.0 6193.0 0.81 695.0 39.0 0.95 2.0 0.26 $100,000+ True 9583.333333 A180369302188889200689E 1.0 11.0 11.0 0.0 0.0 11000.0 9947.9 NaN 0 NaN 6 102670 15000 2013-09-20 00:00:00 Q3 2013 36CE356043264555721F06C 563.97 2819.85 1563.22 1256.63 -60.27 0.0 0.0 0.0 0.0 1.0 0 0 0.0 20
In [5]:
# I want to Check the shape of the data

print('Shape:- \nThe number of loans recorded is', data.shape[0])
print('The number of fields in the dataset is', data.shape[1], end = '\n\nDimension:- \n')


# I want to check the dimension of the data
print(f'The dataset is in {data.ndim} dimension')
Shape:- 
The number of loans recorded is 113937
The number of fields in the dataset is 81

Dimension:- 
The dataset is in 2 dimension
In [6]:
# I want to Check columns in the dataset.

columns = data.columns.to_list()
columns 
Out[6]:
['ListingKey',
 'ListingNumber',
 'ListingCreationDate',
 'CreditGrade',
 'Term',
 'LoanStatus',
 'ClosedDate',
 'BorrowerAPR',
 'BorrowerRate',
 'LenderYield',
 'EstimatedEffectiveYield',
 'EstimatedLoss',
 'EstimatedReturn',
 'ProsperRating (numeric)',
 'ProsperRating (Alpha)',
 'ProsperScore',
 'ListingCategory (numeric)',
 'BorrowerState',
 'Occupation',
 'EmploymentStatus',
 'EmploymentStatusDuration',
 'IsBorrowerHomeowner',
 'CurrentlyInGroup',
 'GroupKey',
 'DateCreditPulled',
 'CreditScoreRangeLower',
 'CreditScoreRangeUpper',
 'FirstRecordedCreditLine',
 'CurrentCreditLines',
 'OpenCreditLines',
 'TotalCreditLinespast7years',
 'OpenRevolvingAccounts',
 'OpenRevolvingMonthlyPayment',
 'InquiriesLast6Months',
 'TotalInquiries',
 'CurrentDelinquencies',
 'AmountDelinquent',
 'DelinquenciesLast7Years',
 'PublicRecordsLast10Years',
 'PublicRecordsLast12Months',
 'RevolvingCreditBalance',
 'BankcardUtilization',
 'AvailableBankcardCredit',
 'TotalTrades',
 'TradesNeverDelinquent (percentage)',
 'TradesOpenedLast6Months',
 'DebtToIncomeRatio',
 'IncomeRange',
 'IncomeVerifiable',
 'StatedMonthlyIncome',
 'LoanKey',
 'TotalProsperLoans',
 'TotalProsperPaymentsBilled',
 'OnTimeProsperPayments',
 'ProsperPaymentsLessThanOneMonthLate',
 'ProsperPaymentsOneMonthPlusLate',
 'ProsperPrincipalBorrowed',
 'ProsperPrincipalOutstanding',
 'ScorexChangeAtTimeOfListing',
 'LoanCurrentDaysDelinquent',
 'LoanFirstDefaultedCycleNumber',
 'LoanMonthsSinceOrigination',
 'LoanNumber',
 'LoanOriginalAmount',
 'LoanOriginationDate',
 'LoanOriginationQuarter',
 'MemberKey',
 'MonthlyLoanPayment',
 'LP_CustomerPayments',
 'LP_CustomerPrincipalPayments',
 'LP_InterestandFees',
 'LP_ServiceFees',
 'LP_CollectionFees',
 'LP_GrossPrincipalLoss',
 'LP_NetPrincipalLoss',
 'LP_NonPrincipalRecoverypayments',
 'PercentFunded',
 'Recommendations',
 'InvestmentFromFriendsCount',
 'InvestmentFromFriendsAmount',
 'Investors']
In [7]:
# I want to See the datatypes of all columns

datatype = data.dtypes.to_frame()
datatype.columns = ['dtype']
datatype['dtype'] = datatype['dtype'].astype('category')
datatype['dtype'] = datatype['dtype'].apply(lambda x: str(x))
datatype.T
Out[7]:
ListingKey ListingNumber ListingCreationDate CreditGrade Term LoanStatus ClosedDate BorrowerAPR BorrowerRate LenderYield EstimatedEffectiveYield EstimatedLoss EstimatedReturn ProsperRating (numeric) ProsperRating (Alpha) ProsperScore ListingCategory (numeric) BorrowerState Occupation EmploymentStatus EmploymentStatusDuration IsBorrowerHomeowner CurrentlyInGroup GroupKey DateCreditPulled CreditScoreRangeLower CreditScoreRangeUpper FirstRecordedCreditLine CurrentCreditLines OpenCreditLines TotalCreditLinespast7years OpenRevolvingAccounts OpenRevolvingMonthlyPayment InquiriesLast6Months TotalInquiries CurrentDelinquencies AmountDelinquent DelinquenciesLast7Years PublicRecordsLast10Years PublicRecordsLast12Months RevolvingCreditBalance BankcardUtilization AvailableBankcardCredit TotalTrades TradesNeverDelinquent (percentage) TradesOpenedLast6Months DebtToIncomeRatio IncomeRange IncomeVerifiable StatedMonthlyIncome LoanKey TotalProsperLoans TotalProsperPaymentsBilled OnTimeProsperPayments ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate ProsperPrincipalBorrowed ProsperPrincipalOutstanding ScorexChangeAtTimeOfListing LoanCurrentDaysDelinquent LoanFirstDefaultedCycleNumber LoanMonthsSinceOrigination LoanNumber LoanOriginalAmount LoanOriginationDate LoanOriginationQuarter MemberKey MonthlyLoanPayment LP_CustomerPayments LP_CustomerPrincipalPayments LP_InterestandFees LP_ServiceFees LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss LP_NonPrincipalRecoverypayments PercentFunded Recommendations InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors
dtype object int64 object object int64 object object float64 float64 float64 float64 float64 float64 float64 object float64 int64 object object object float64 bool bool object object float64 float64 object float64 float64 float64 int64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 object bool float64 object float64 float64 float64 float64 float64 float64 float64 float64 int64 float64 int64 int64 int64 object object object float64 float64 float64 float64 float64 float64 float64 float64 float64 float64 int64 int64 float64 int64

Relative proportions of each datatype

In [8]:
# I want to Visualise the relative proportions of each datatype to other datatypes

# Calculate the relative proportions 
no_of_obsevation =  datatype['dtype'].shape[0]
max_proportion = datatype['dtype'].value_counts().to_list()[0]/no_of_obsevation
tick_proportions = np.arange(0, max_proportion, 0.03)
tick_names = ('{:0.2f}'.format(v) for v in tick_proportions)

#
plt.figure(figsize =(6,6))
order = datatype['dtype'].value_counts().index.to_list()
sb.countplot(x = "dtype" , data =datatype, order = order, color =sb.color_palette()[7],)
plt.xlabel('data types', fontdict = {'weight': 'bold'})
plt.ylabel('count of data types', fontdict = {'weight': 'bold'})
plt.yticks(tick_proportions * no_of_obsevation, tick_names);

#Calculate and display the percentage of each datatype
for i in range(datatype['dtype'].value_counts().shape[0]):
    count = datatype['dtype'].value_counts().to_list()[i]
    pct_string = '{:0.1f}%'.format(100*(count/no_of_obsevation))
    plt.text(i,count+1, pct_string, ha = 'center');
In [9]:
#I want to extract columns of object datatype

object_columns = datatype[datatype['dtype'] == 'object'].index.to_list()
object_columns
Out[9]:
['ListingKey',
 'ListingCreationDate',
 'CreditGrade',
 'LoanStatus',
 'ClosedDate',
 'ProsperRating (Alpha)',
 'BorrowerState',
 'Occupation',
 'EmploymentStatus',
 'GroupKey',
 'DateCreditPulled',
 'FirstRecordedCreditLine',
 'IncomeRange',
 'LoanKey',
 'LoanOriginationDate',
 'LoanOriginationQuarter',
 'MemberKey']

Data Cleaning

In [10]:
# make a copy of original data to clean
data_copy = data.copy()
In [11]:
# From the data description CreditGrade is for loans before 2009 and ProsperRating (Alpha) is 
#for loans from 2009

data_copy['ProsperRating (Alpha)'] = data_copy[['CreditGrade', 'ProsperRating (Alpha)']]\
['ProsperRating (Alpha)'].fillna(data_copy['CreditGrade'])
data_copy['CreditGrade'] = data_copy[['CreditGrade', 'ProsperRating (Alpha)']]['CreditGrade']\
.fillna(data_copy['ProsperRating (Alpha)'])

#test
data_copy[['CreditGrade', 'ProsperRating (Alpha)']].head(5)
Out[11]:
CreditGrade ProsperRating (Alpha)
0 C C
1 A A
2 HR HR
3 A A
4 D D

Redefining categorical data type into ordinal and nominal

In [12]:
#I want to extract the quarter from the LoanOriginationQuarter column

data_copy['LoanOriginationQuarter'] = data_copy['LoanOriginationQuarter'].apply(lambda x: x.split(' ')[0])

# test
data_copy['LoanOriginationQuarter'][1]
Out[12]:
'Q1'
In [13]:
#I want to define the ordinal and nominal data

# ordinal categorical
ordinal_categorical = {'ProsperRating (Alpha)':['NC','HR', 'E', 'D', 'C', 'B', 'A', 'AA'],
                       'CreditGrade': ['NC', 'HR', 'E', 'D', 'C', 'B', 'A', 'AA'],
                       'IncomeRange': ['$0', '$1 - 24,999', '$25,000 - 49,999', '$50,000 - 74,999',\
                                     '$75,000 - 99,999', '$100,000+', 'Not employed', 'Not displayed'],
                       'LoanOriginationQuarter': ['Q1', 'Q2', 'Q3', 'Q4']
                      }


#nominal categorical
nominal_categorical = ['EmploymentStatus','LoanStatus', 'BorrowerState', 'Occupation']
In [14]:
for var in ordinal_categorical:
    ordered_var = pd.api.types.CategoricalDtype(ordered = True,
                                                categories = ordinal_categorical[var])
    data_copy[var] = data_copy[var].astype(ordered_var)
        
for var in nominal_categorical:
    data_copy[var] = data_copy[var].astype('category')
    

# test
print(data_copy[ordinal_categorical].dtypes, end = '\n\nnominal data \n')
print(data_copy[nominal_categorical].dtypes)
ProsperRating (Alpha)     category
CreditGrade               category
IncomeRange               category
LoanOriginationQuarter    category
dtype: object

nominal data 
EmploymentStatus    category
LoanStatus          category
BorrowerState       category
Occupation          category
dtype: object
In [15]:
# remove column I won't need in the analysis

data_copy =  data_copy[['LoanStatus', 'LenderYield', 'ProsperRating (Alpha)', 'ListingCategory (numeric)', \
                        'BorrowerState', 'Occupation', 'EmploymentStatus', 'IsBorrowerHomeowner', \
                        'CurrentlyInGroup', 'CreditScoreRangeLower', 'CreditScoreRangeUpper', \
                        'TotalCreditLinespast7years', 'DebtToIncomeRatio', 'IncomeRange', \
                        'IncomeVerifiable', 'LoanOriginationQuarter', 'MonthlyLoanPayment', \
                        'Recommendations', 'InvestmentFromFriendsCount', 'InvestmentFromFriendsAmount',\
                        'Investors', 'PercentFunded']]

# test
print(data_copy.columns.to_list())
data_copy.head()
['LoanStatus', 'LenderYield', 'ProsperRating (Alpha)', 'ListingCategory (numeric)', 'BorrowerState', 'Occupation', 'EmploymentStatus', 'IsBorrowerHomeowner', 'CurrentlyInGroup', 'CreditScoreRangeLower', 'CreditScoreRangeUpper', 'TotalCreditLinespast7years', 'DebtToIncomeRatio', 'IncomeRange', 'IncomeVerifiable', 'LoanOriginationQuarter', 'MonthlyLoanPayment', 'Recommendations', 'InvestmentFromFriendsCount', 'InvestmentFromFriendsAmount', 'Investors', 'PercentFunded']
Out[15]:
LoanStatus LenderYield ProsperRating (Alpha) ListingCategory (numeric) BorrowerState Occupation EmploymentStatus IsBorrowerHomeowner CurrentlyInGroup CreditScoreRangeLower CreditScoreRangeUpper TotalCreditLinespast7years DebtToIncomeRatio IncomeRange IncomeVerifiable LoanOriginationQuarter MonthlyLoanPayment Recommendations InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors PercentFunded
0 Completed 0.1380 C 0 CO Other Self-employed True True 640.0 659.0 12.0 0.17 NaN True Q3 330.43 0 0 0.0 258 1.0
1 Current 0.0820 A 2 CO Professional Employed False False 680.0 699.0 29.0 0.18 NaN True Q1 318.93 0 0 0.0 1 1.0
2 Completed 0.2400 HR 0 GA Other Not available False True 480.0 499.0 3.0 0.06 Not displayed True Q1 123.32 0 0 0.0 41 1.0
3 Current 0.0874 A 16 GA Skilled Labor Employed True False 800.0 819.0 29.0 0.15 NaN True Q4 321.45 0 0 0.0 158 1.0
4 Current 0.1985 D 2 MN Executive Employed True False 680.0 699.0 49.0 0.26 $100,000+ True Q3 563.97 0 0 0.0 20 1.0
In [16]:
#I want to for duplicate  observations

data.duplicated().sum()
Out[16]:
0
In [17]:
#I want to check for null values

data_copy.isnull().sum()
Out[17]:
LoanStatus                         0
LenderYield                        0
ProsperRating (Alpha)            131
ListingCategory (numeric)          0
BorrowerState                   5515
Occupation                      3588
EmploymentStatus                2255
IsBorrowerHomeowner                0
CurrentlyInGroup                   0
CreditScoreRangeLower            591
CreditScoreRangeUpper            591
TotalCreditLinespast7years       697
DebtToIncomeRatio               8554
IncomeRange                    87432
IncomeVerifiable                   0
LoanOriginationQuarter             0
MonthlyLoanPayment                 0
Recommendations                    0
InvestmentFromFriendsCount         0
InvestmentFromFriendsAmount        0
Investors                          0
PercentFunded                      0
dtype: int64
In [18]:
#I want to filter out names of columns with missing data

columns_with_missing_data = data_copy.columns[data_copy.isnull().sum() >0]
columns_with_missing_data
Out[18]:
Index(['ProsperRating (Alpha)', 'BorrowerState', 'Occupation',
       'EmploymentStatus', 'CreditScoreRangeLower', 'CreditScoreRangeUpper',
       'TotalCreditLinespast7years', 'DebtToIncomeRatio', 'IncomeRange'],
      dtype='object')

Visualizing missing data

In [19]:
#Visualize how much data is missing in each column

counts_of_missing_value = data_copy[columns_with_missing_data].isnull().sum()
percentage = counts_of_missing_value/data_copy.shape[0] * 100
sb.barplot(y = counts_of_missing_value.index, x =counts_of_missing_value, color = 'blue')

for i in range(percentage.shape[0]):
    plt.text(x = counts_of_missing_value[i],  y = i, s = '{:0.2f}%'.format(percentage[i]))

plt.xlabel('percentage of missing values', fontdict = {'weight': 'bold'})
plt.ylabel('missingvalues', fontdict = {'weight': 'bold'});
In [20]:
#In order not to lose many data , I want to check observattions that had missing values in 
#more than one feature

def both_missing(column1, column2):
    return len(data_copy[data_copy[column1].isna() & data_copy[column2].isna()])




print(both_missing('ProsperRating (Alpha)', 'DebtToIncomeRatio'))
print(both_missing('DebtToIncomeRatio', 'BorrowerState'))
print(both_missing('ProsperRating (Alpha)', 'BorrowerState'))
11
72
0
In [21]:
data_copy[columns_with_missing_data.to_list()].isna().sum()
Out[21]:
ProsperRating (Alpha)           131
BorrowerState                  5515
Occupation                     3588
EmploymentStatus               2255
CreditScoreRangeLower           591
CreditScoreRangeUpper           591
TotalCreditLinespast7years      697
DebtToIncomeRatio              8554
IncomeRange                   87432
dtype: int64
In [22]:
print(both_missing('ProsperRating (Alpha)', 'DebtToIncomeRatio'))
print(both_missing('EmploymentStatus', 'Occupation'))
11
2255
In [23]:
# since 'Occupation' is missing for every missing 'EmploymentStatus' values, I will drop all
#missing values in 'EmploymentStatus' 

data_copy = data_copy.dropna(subset = ['EmploymentStatus'])

#test
data_copy['EmploymentStatus'].isna().sum()
Out[23]:
0
In [24]:
data_copy[columns_with_missing_data.to_list()].isna().sum()
Out[24]:
ProsperRating (Alpha)           131
BorrowerState                  4128
Occupation                     1333
EmploymentStatus                  0
CreditScoreRangeLower             2
CreditScoreRangeUpper             2
TotalCreditLinespast7years       88
DebtToIncomeRatio              8525
IncomeRange                   87432
dtype: int64
In [25]:
print(both_missing('ProsperRating (Alpha)', 'TotalCreditLinespast7years'))
print(both_missing('TotalCreditLinespast7years', 'DebtToIncomeRatio'))
print(both_missing('ProsperRating (Alpha)', 'DebtToIncomeRatio'))
0
86
11
In [26]:
# I will drop all null values in TotalCreditLinespast7years, to also get rid of null values in 
#DebtToIncomeRatio

data_copy = data_copy.dropna(subset = ['TotalCreditLinespast7years'])

#test
data_copy['TotalCreditLinespast7years'].isna().sum()
Out[26]:
0
In [27]:
print(both_missing('ProsperRating (Alpha)', 'DebtToIncomeRatio'))
11
In [28]:
# I will drop all null values in ProsperRating (Alpha), to also get rid of null values in 
#DebtToIncomeRatio

data_copy = data_copy.dropna(subset = ['ProsperRating (Alpha)'])

#test
data_copy['ProsperRating (Alpha)'].isna().sum()
Out[28]:
0
In [29]:
data_copy[columns_with_missing_data.to_list()].isna().sum()
Out[29]:
ProsperRating (Alpha)             0
BorrowerState                  4102
Occupation                     1333
EmploymentStatus                  0
CreditScoreRangeLower             0
CreditScoreRangeUpper             0
TotalCreditLinespast7years        0
DebtToIncomeRatio              8428
IncomeRange                   87303
dtype: int64
In [30]:
# check the percentage of data remaining after dropping some observations

print('percentage of data left', (data_copy.shape[0]/data.shape[0]) * 100)
percentage of data left 97.82862459078262
In [31]:
#check how much percentage of data is still missing for each variable

def percentage_missing(column):
    value = data_copy[column].isna().sum() / data.shape[0] * 100
    print('percentage of data still missing in', column, 'is', '{:0.2f}%'.format(value))
    
percentage_missing('BorrowerState')
percentage_missing('Occupation')
percentage_missing('DebtToIncomeRatio')
percentage of data still missing in BorrowerState is 3.60%
percentage of data still missing in Occupation is 1.17%
percentage of data still missing in DebtToIncomeRatio is 7.40%

I want to check if the variables with missing data have missing data together in one observation

In [32]:
def index_generator(column_name):
    return data_copy[column_name][data_copy[column_name].isna()].index.to_list()
In [33]:
#retrieve the index of observation with null values in the 'ProsperRating (Alpha)' variable

null_state =index_generator('BorrowerState')
In [34]:
##retrieve the index of observation with null values in the 'Occupation' variable

null_Occupation = index_generator('Occupation')
In [35]:
###retrieve the index of observation with null values in the 'DebtToIncomeRatio' variable

null_income = index_generator('DebtToIncomeRatio')
In [36]:
#compare the lists
def compare_lists(list1, list2):
    index1 = [i for i in list1 if i in list2]
    index2 = [i for i in list2 if i in list1]
    return len(index1), len(index2)
In [37]:
#compare the null_Occupation and null_ProsperRating to see if they have similar indices

compare_lists(null_state, null_Occupation)
Out[37]:
(0, 0)
In [38]:
#compare the null_Occupation and null_income to see if they have similar indices

compare_lists(null_income, null_Occupation)
Out[38]:
(0, 0)
In [39]:
#compare the null_ProsperRating and null_income to see if they have similar indices

compare_lists(null_income, null_state)
Out[39]:
(32, 32)
In [40]:
#check the observations with the similar indices in null_ProsperRating and null_income
index = [i for i in null_income if i in null_state]
data.loc[index]
Out[40]:
ListingKey ListingNumber ListingCreationDate CreditGrade Term LoanStatus ClosedDate BorrowerAPR BorrowerRate LenderYield EstimatedEffectiveYield EstimatedLoss EstimatedReturn ProsperRating (numeric) ProsperRating (Alpha) ProsperScore ListingCategory (numeric) BorrowerState Occupation EmploymentStatus EmploymentStatusDuration IsBorrowerHomeowner CurrentlyInGroup GroupKey DateCreditPulled CreditScoreRangeLower CreditScoreRangeUpper FirstRecordedCreditLine CurrentCreditLines OpenCreditLines TotalCreditLinespast7years OpenRevolvingAccounts OpenRevolvingMonthlyPayment InquiriesLast6Months TotalInquiries CurrentDelinquencies AmountDelinquent DelinquenciesLast7Years PublicRecordsLast10Years PublicRecordsLast12Months RevolvingCreditBalance BankcardUtilization AvailableBankcardCredit TotalTrades TradesNeverDelinquent (percentage) TradesOpenedLast6Months DebtToIncomeRatio IncomeRange IncomeVerifiable StatedMonthlyIncome LoanKey TotalProsperLoans TotalProsperPaymentsBilled OnTimeProsperPayments ProsperPaymentsLessThanOneMonthLate ProsperPaymentsOneMonthPlusLate ProsperPrincipalBorrowed ProsperPrincipalOutstanding ScorexChangeAtTimeOfListing LoanCurrentDaysDelinquent LoanFirstDefaultedCycleNumber LoanMonthsSinceOrigination LoanNumber LoanOriginalAmount LoanOriginationDate LoanOriginationQuarter MemberKey MonthlyLoanPayment LP_CustomerPayments LP_CustomerPrincipalPayments LP_InterestandFees LP_ServiceFees LP_CollectionFees LP_GrossPrincipalLoss LP_NetPrincipalLoss LP_NonPrincipalRecoverypayments PercentFunded Recommendations InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors
917 7FC53366128277829738FCD 30111 2006-08-09 06:46:29.320000000 A 36 Completed 2009-08-15 00:00:00 0.15713 0.1500 0.1425 NaN NaN NaN NaN NaN NaN 0 NaN Other Not available NaN True True EF543365873906104DC9A04 2006-08-08 12:02:36.813000000 720.0 739.0 1984-05-11 00:00:00 NaN NaN 17.0 3 296.0 6.0 14.0 0.0 NaN 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN Not displayed False 0.000000 876B3366000239247C55054 NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 91 2552 8000 2006-08-15 00:00:00 Q3 2006 4D1333648643890119C04A5 274.24 9953.49 8000.00 1953.49 -66.22 0.00 0.00 0.00 0.0 1.0 0 0 0.00 63
948 0A963376701968804B62AB3 74226 2006-12-13 13:47:01.053000000 C 36 Completed 2009-07-13 00:00:00 0.14770 0.1299 0.1249 NaN NaN NaN NaN NaN NaN 0 NaN Homemaker Not available NaN True False NaN 2006-12-12 08:47:21.410000000 660.0 679.0 2003-05-31 00:00:00 NaN NaN 17.0 9 184.0 7.0 13.0 0.0 NaN 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN Not displayed False 0.000000 078433761616917711A6C73 NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 87 5636 1000 2006-12-19 00:00:00 Q4 2006 C4553375761266512DFC98A 33.69 1206.41 1000.00 206.41 -7.94 0.00 0.00 0.00 0.0 1.0 0 0 0.00 37
5054 94CE33685565947669D406A 37916 2006-09-08 18:39:55.137000000 D 36 Defaulted 2007-10-14 00:00:00 0.24753 0.2400 0.2350 NaN NaN NaN NaN NaN NaN 0 NaN Homemaker Not available NaN False True 6A3B336601725506917317E 2006-08-27 14:29:37.600000000 620.0 639.0 1994-11-21 00:00:00 NaN NaN 13.0 3 151.0 1.0 2.0 0.0 NaN 1.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN Not displayed False 0.000000 02513368465834817A8009A NaN NaN NaN NaN NaN NaN NaN NaN 193 13.0 90 3262 11000 2006-09-15 00:00:00 Q3 2006 4B573366565771652374F94 431.56 3442.48 1794.32 1648.16 -34.02 0.00 9212.55 9205.68 0.0 1.0 0 0 0.00 19
13964 06F33393589784867BF32D9 159317 2007-06-27 12:57:21.037000000 AA 36 Completed 2008-01-16 00:00:00 0.10080 0.0939 0.0864 NaN NaN NaN NaN NaN NaN 0 NaN Computer Programmer Full-time 1.0 False True FDB233849559610962FC479 2007-06-09 19:27:02.840000000 760.0 779.0 2002-01-02 00:00:00 6.0 6.0 6.0 5 10.0 0.0 4.0 0.0 0.0 0.0 0.0 0.0 1118.0 0.09 10209.0 6.0 1.00 0.0 NaN $0 False 0.000000 39763393819595641C6F0D0 NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 80 16724 4000 2007-07-06 00:00:00 Q3 2007 68ED33711626790328D1B4D 127.93 4135.64 4000.00 135.64 -7.41 0.00 0.00 0.00 0.0 1.0 1 0 0.00 58
15719 19DB336840890878311FC84 41907 2006-09-24 20:46:00.410000000 E 36 Completed 2006-11-20 00:00:00 0.25261 0.2400 0.2350 NaN NaN NaN NaN NaN NaN 0 NaN Homemaker Not available NaN True True 6A3B336601725506917317E 2006-09-24 20:37:36.697000000 540.0 559.0 1999-12-21 00:00:00 NaN NaN 17.0 1 20.0 0.0 4.0 0.0 NaN 5.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN Not displayed False 0.000000 BFF03369288422951BDED73 NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 90 3549 1500 2006-09-28 00:00:00 Q3 2006 3EF133647645155044BFFD9 58.85 1551.83 1500.00 51.83 -1.08 0.00 0.00 0.00 0.0 1.0 0 0 0.00 11
17763 99FE3403724684817F1604B 224416 2007-11-01 13:20:23.103000000 AA 36 Completed 2008-08-07 00:00:00 0.11194 0.1050 0.1050 NaN NaN NaN NaN NaN NaN 0 NaN Realtor Self-employed 88.0 True False NaN 2007-10-07 08:52:40.200000000 760.0 779.0 1988-06-07 00:00:00 14.0 9.0 39.0 10 1068.0 5.0 18.0 0.0 0.0 0.0 0.0 0.0 257784.0 0.60 169816.0 36.0 1.00 3.0 NaN $0 False 0.000000 40BF340359945665641DB5E NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 76 21485 2800 2007-11-08 00:00:00 Q4 2007 CE4733869220449489EC671 91.01 2854.27 2800.00 54.27 0.00 0.00 0.00 0.00 0.0 1.0 1 0 0.00 23
21404 49563395624224250B02409 180236 2007-08-01 20:04:15.273000000 C 36 Completed 2010-06-04 00:00:00 0.13453 0.1275 0.1175 NaN NaN NaN NaN NaN NaN 0 NaN Other Not employed 25.0 False True FB7E337608795605823EDCF 2007-08-01 19:56:14.670000000 640.0 659.0 1998-08-21 00:00:00 4.0 4.0 26.0 4 76.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 17531.0 0.88 2269.0 12.0 1.00 0.0 NaN $0 False 0.000000 8C3A3397631428144F0C268 NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 79 18006 3000 2007-08-15 00:00:00 Q3 2007 5709339069157083673EE2F 100.72 3628.17 3000.01 628.16 -49.26 0.00 0.00 0.00 0.0 1.0 0 0 0.00 56
21935 666D337504009381715A4FD 68089 2006-11-29 12:08:13.453000000 E 36 Completed 2006-12-28 00:00:00 0.24753 0.2400 0.2350 NaN NaN NaN NaN NaN NaN 0 NaN Homemaker Not available NaN True True 6A3B336601725506917317E 2006-11-28 14:58:52.150000000 540.0 559.0 1999-02-17 00:00:00 NaN NaN 19.0 1 20.0 1.0 5.0 0.0 NaN 5.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN Not displayed False 0.000000 2F043374637793948D619F3 NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 87 5228 3250 2006-12-06 00:00:00 Q4 2006 3EF133647645155044BFFD9 127.51 3297.01 3250.00 47.01 -0.98 0.00 0.00 0.00 0.0 1.0 0 0 0.00 37
24845 5D4033769927689389FBCF1 77943 2006-12-26 02:41:50.547000000 D 36 Completed 2008-08-12 00:00:00 0.15462 0.1475 0.1425 NaN NaN NaN NaN NaN NaN 0 NaN Other Not available NaN False False NaN 2006-12-25 23:36:56.097000000 600.0 619.0 2002-02-11 00:00:00 NaN NaN 9.0 7 753.0 0.0 3.0 0.0 NaN 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN Not displayed False 0.000000 B5253378016607860684EE2 NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 86 6156 3500 2007-01-08 00:00:00 Q1 2007 AD3C3376471933313797629 120.90 4144.13 3500.00 644.13 -21.82 0.00 0.00 0.00 0.0 1.0 0 0 0.00 94
27749 B4713370105500182A62F5B 41723 2006-09-23 19:22:28.417000000 A 36 Completed 2007-02-22 00:00:00 0.15713 0.1500 0.1450 NaN NaN NaN NaN NaN NaN 0 NaN Student - College Junior Not available NaN False True 6A9C3365296327999C99787 2006-09-23 19:12:06.550000000 740.0 759.0 2001-05-30 00:00:00 NaN NaN 11.0 5 10.0 0.0 1.0 0.0 NaN 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN Not displayed False 0.000000 6C5B3369151726192E59AB9 NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 89 3672 6001 2006-10-03 00:00:00 Q4 2006 BB183368333985110E03E3D 208.03 6289.04 6001.00 288.04 -9.60 0.00 0.00 0.00 0.0 1.0 0 0 0.00 64
38225 103533800679133562C4402 92861 2007-02-03 12:40:43.577000000 A 36 Completed 2009-11-13 00:00:00 0.10739 0.0900 0.0850 NaN NaN NaN NaN NaN NaN 0 NaN Student - College Junior Not available NaN False False NaN 2007-01-19 12:15:02.527000000 720.0 739.0 2004-10-05 00:00:00 NaN NaN 4.0 2 20.0 0.0 1.0 0.0 NaN 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN Not displayed False 0.000000 12233381845927925D6CEDD NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 85 7389 1000 2007-02-13 00:00:00 Q1 2007 9B4D33660618611019BA145 31.80 1168.07 1027.55 140.52 -6.54 -19.58 0.00 0.00 0.0 1.0 0 0 0.00 40
42395 93B4339141342851666956C 152073 2007-06-14 10:37:51.910000000 C 36 Completed 2009-08-14 00:00:00 0.18907 0.1818 0.1718 NaN NaN NaN NaN NaN NaN 0 NaN Teacher Part-time 5.0 False False NaN 2007-06-14 10:14:48.907000000 640.0 659.0 1998-07-02 00:00:00 14.0 10.0 27.0 7 151.0 1.0 6.0 0.0 0.0 0.0 0.0 0.0 11166.0 0.57 8384.0 27.0 0.96 1.0 NaN $0 False 0.000000 E2AB3393423421227FEFFA7 NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 81 16260 4000 2007-06-21 00:00:00 Q2 2007 4E743391954562419C38EF1 144.97 5103.77 4000.00 1103.77 -60.71 0.00 0.00 0.00 0.0 1.0 1 0 0.00 76
44905 8C903386539459536ABD2F9 123824 2007-04-15 11:49:55.730000000 HR 36 Completed 2008-03-06 00:00:00 0.11898 0.1050 0.0950 NaN NaN NaN NaN NaN NaN 0 NaN Student - College Senior Full-time 44.0 False False NaN 2007-04-06 14:48:04.680000000 540.0 559.0 1996-09-08 00:00:00 3.0 2.0 25.0 2 84.0 2.0 4.0 1.0 1132.0 0.0 0.0 0.0 2026.0 1.01 0.0 3.0 0.50 2.0 NaN $0 False 0.000000 31FF3387378976785B0499B NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 83 13099 1975 2007-04-20 00:00:00 Q2 2007 C79A338470769102830EE85 64.19 2136.22 1975.00 161.22 -15.37 0.00 0.00 0.00 0.0 1.0 0 0 0.00 1
51075 E7D83395106515050AC9B74 170900 2007-07-17 08:16:19.527000000 B 36 Completed 2010-07-31 00:00:00 0.29625 0.2885 0.2585 NaN NaN NaN NaN NaN NaN 0 NaN Other Full-time 363.0 True True 9BBE337094173775621CD34 2007-06-25 07:28:38.523000000 680.0 699.0 1983-08-16 00:00:00 42.0 33.0 59.0 32 4005.0 2.0 5.0 0.0 0.0 0.0 0.0 0.0 117920.0 0.94 6202.0 58.0 0.98 0.0 NaN $0 False 0.000000 E88C33951125589574A136B NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 80 17452 10000 2007-07-31 00:00:00 Q3 2007 F13F3393693437180BDDFCD 418.24 14721.75 10000.00 4721.75 -175.86 0.00 0.00 0.00 0.0 1.0 0 0 0.00 106
56280 8CE33394075637813E092F3 173354 2007-07-20 18:50:30.787000000 C 36 Chargedoff 2008-04-02 00:00:00 0.29274 0.2850 0.2750 NaN NaN NaN NaN NaN NaN 0 NaN Sales - Retail Self-employed 198.0 True False NaN 2007-07-01 20:37:02.430000000 640.0 659.0 1990-04-04 00:00:00 20.0 19.0 42.0 16 2018.0 2.0 30.0 0.0 0.0 0.0 0.0 0.0 66008.0 0.95 2819.0 39.0 0.97 0.0 NaN $0 False 0.000000 1AB23395205101109B0A95F NaN NaN NaN NaN NaN NaN NaN NaN 2289 8.0 79 17678 12000 2007-08-03 00:00:00 Q3 2007 414B33944181094090623DC 499.61 1498.83 652.17 846.66 -29.70 0.00 11347.84 11347.83 0.0 1.0 0 0 0.00 120
56311 DEA5337179891180924526A 55588 2006-10-31 16:14:05.487000000 C 36 Completed 2007-02-09 00:00:00 0.12700 0.1200 0.1150 NaN NaN NaN NaN NaN NaN 0 NaN Student - College Graduate Student Not available NaN False True E3F433654060844916DB34F 2006-10-31 12:12:23.563000000 660.0 679.0 2004-10-28 00:00:00 NaN NaN 15.0 2 16.0 3.0 11.0 0.0 NaN 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN Not displayed False 0.083333 EF2633730227404806039FE NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 88 4730 5000 2006-11-17 00:00:00 Q4 2006 8FA3337099178353715104A 166.07 5091.20 5000.00 91.20 -3.80 0.00 0.00 0.00 0.0 1.0 0 0 0.00 116
57556 B3133364585382753630D0C 25751 2006-07-20 16:00:23.170000000 D 36 Chargedoff 2007-12-26 00:00:00 0.28771 0.2800 0.2750 NaN NaN NaN NaN NaN NaN 0 NaN Tradesman - Electrician Not available NaN False True 3D4D3366260257624AB272D 2006-07-06 20:24:50.503000000 600.0 619.0 2003-05-04 00:00:00 NaN NaN 11.0 7 355.0 8.0 11.0 0.0 NaN 0.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN Not displayed False 0.000000 42E333648150346424F7315 NaN NaN NaN NaN NaN NaN NaN NaN 2387 17.0 92 2155 12000 2006-07-27 00:00:00 Q3 2006 AE103365589750373D5D9F1 496.36 5949.96 2905.73 3044.23 -54.37 0.00 9094.25 9094.27 0.0 1.0 0 0 0.00 44
66262 6F363404188796708A9E6C1 223891 2007-10-31 12:09:33.847000000 AA 36 Completed 2007-11-09 00:00:00 0.13202 0.1250 0.1250 NaN NaN NaN NaN NaN NaN 0 NaN Other Full-time 55.0 True False NaN 2007-10-31 11:42:09.327000000 760.0 779.0 1995-03-22 00:00:00 12.0 10.0 26.0 9 1510.0 3.0 5.0 0.0 0.0 0.0 0.0 0.0 265702.0 0.85 45148.0 26.0 1.00 0.0 NaN $0 False 0.000000 484C340557954079480BD32 NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 76 21497 20000 2007-11-08 00:00:00 Q4 2007 6F09336433415858375CF9F 669.07 20006.85 20000.00 6.85 0.00 0.00 0.00 0.00 0.0 1.0 0 0 0.00 417
68284 C15F3398836613855BCCA50 197124 2007-09-04 12:13:37.447000000 C 36 Completed 2010-08-12 00:00:00 0.20138 0.1830 0.1630 NaN NaN NaN NaN NaN NaN 0 NaN Other Self-employed 12.0 True True 223133764057631067F3FD8 2007-09-04 11:48:58.570000000 640.0 659.0 1995-04-25 00:00:00 8.0 7.0 16.0 5 72.0 20.0 38.0 0.0 0.0 15.0 0.0 0.0 1241.0 0.40 1397.0 13.0 0.57 0.0 NaN $0 False 0.000000 08F534005406541768A8A04 NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 78 19110 1000 2007-09-12 00:00:00 Q3 2007 05A33385054509573CEA393 36.30 1274.44 999.52 274.92 -15.89 0.00 0.00 0.00 0.0 1.0 0 0 0.00 9
70356 829333841556055030BE3C7 107979 2007-03-07 12:45:19.150000000 D 36 Chargedoff 2007-12-28 00:00:00 0.18224 0.1750 0.1650 NaN NaN NaN NaN NaN NaN 0 NaN Other Full-time 7.0 False True 783C3371218786870A73D20 2007-03-03 06:32:19.297000000 620.0 639.0 2006-05-13 00:00:00 11.0 10.0 11.0 10 20.0 3.0 9.0 0.0 0.0 0.0 0.0 0.0 222.0 0.08 2286.0 11.0 1.00 7.0 NaN $0 False 0.083333 56B93386301660057C8F161 NaN NaN NaN NaN NaN NaN NaN NaN 2385 9.0 84 9472 4250 2007-03-29 00:00:00 Q1 2007 A79A33648508891688650E6 152.58 610.32 571.43 38.89 -2.22 0.00 3678.57 3678.57 0.0 1.0 0 0 0.00 165
76338 19933382706732403886B11 105697 2007-03-02 16:47:36.507000000 A 36 Completed 2007-10-03 00:00:00 0.13705 0.1300 0.1250 NaN NaN NaN NaN NaN NaN 0 NaN Professional Full-time 11.0 False False NaN 2007-03-01 11:11:26.437000000 740.0 759.0 1975-06-01 00:00:00 8.0 7.0 42.0 8 60.0 0.0 14.0 0.0 0.0 0.0 0.0 0.0 1349.0 0.09 13101.0 29.0 0.75 4.0 NaN $0 False 0.000000 BED433852693697269BA986 NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 84 8871 16200 2007-03-23 00:00:00 Q1 2007 D43F337970435057119C61D 545.84 17195.05 16200.00 995.05 -38.27 0.00 0.00 0.00 0.0 1.0 0 0 0.00 408
80597 1DE03392951950447A91445 162907 2007-07-04 08:12:22.003000000 D 36 Chargedoff 2009-10-21 00:00:00 0.19228 0.1850 0.1750 NaN NaN NaN NaN NaN NaN 0 NaN Other Not employed 2.0 False True FB7E337608795605823EDCF 2007-06-06 16:58:37.267000000 600.0 619.0 2004-04-07 00:00:00 8.0 8.0 10.0 7 213.0 4.0 5.0 0.0 0.0 0.0 0.0 0.0 7782.0 0.76 1584.0 9.0 1.00 4.0 NaN $0 False 0.000000 9B413395628927631C74605 NaN NaN NaN NaN NaN NaN NaN NaN 1724 27.0 80 17101 7100 2007-07-20 00:00:00 Q3 2007 568C3391154789199B6D502 258.47 5666.34 3814.29 1852.05 -100.11 0.00 3285.71 3285.71 0.0 1.0 0 0 0.00 88
81885 E38D33986493625004FDEB9 198185 2007-09-06 08:54:19.220000000 A 36 Completed 2008-07-16 00:00:00 0.10867 0.1000 0.0950 NaN NaN NaN NaN NaN NaN 0 NaN Other Not employed 41.0 False False NaN 2007-09-03 11:15:51.967000000 740.0 759.0 2005-08-03 00:00:00 2.0 2.0 2.0 2 34.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 283.0 0.03 8717.0 2.0 1.00 0.0 NaN Not employed False 0.000000 8964339941860481288A714 NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 78 19410 2000 2007-09-20 00:00:00 Q3 2007 9D8F33958073759508B6030 64.53 2146.54 2000.00 146.54 -7.32 0.00 0.00 0.00 0.0 1.0 1 1 71.31 46
83675 CB3F3382864519577D0E8E0 101870 2007-02-20 16:03:29.673000000 AA 36 Completed 2007-05-30 00:00:00 0.09688 0.0900 0.0825 NaN NaN NaN NaN NaN NaN 0 NaN Homemaker Not employed 359.0 False True B7FA3373291502314D48D19 2007-02-16 20:14:11.470000000 800.0 819.0 1975-11-01 00:00:00 6.0 6.0 12.0 9 70.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1474.0 0.05 25426.0 12.0 1.00 1.0 NaN Not employed False 0.000000 AC4333850070866652A694E NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 84 8581 4000 2007-03-16 00:00:00 Q1 2007 51053374850843551DC6120 127.20 4062.07 4000.00 62.07 -3.55 0.00 0.00 0.00 0.0 1.0 3 2 700.00 50
84893 B19C339311699177459CBB2 159719 2007-06-28 07:54:13.017000000 A 36 Completed 2008-12-11 00:00:00 0.10692 0.1000 0.0950 NaN NaN NaN NaN NaN NaN 0 NaN Attorney Full-time 0.0 False True FB7E337608795605823EDCF 2007-06-27 19:14:09.720000000 720.0 739.0 2000-11-30 00:00:00 13.0 8.0 35.0 7 122.0 2.0 5.0 0.0 0.0 0.0 0.0 0.0 4856.0 0.25 11319.0 18.0 0.94 4.0 NaN $0 False 0.000000 7BCF339483569111439E256 NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 80 16741 5000 2007-07-06 00:00:00 Q3 2007 421A33930603173928E50D4 161.34 5573.31 5000.00 573.31 -28.67 0.00 0.00 0.00 0.0 1.0 0 0 0.00 142
95185 AE443402936913216332C60 220685 2007-10-24 09:25:07.690000000 B 36 Defaulted 2008-10-05 00:00:00 0.15713 0.1500 0.1400 NaN NaN NaN NaN NaN NaN 0 NaN Realtor Full-time 45.0 True False NaN 2007-10-10 00:16:26.080000000 680.0 699.0 1987-04-03 00:00:00 19.0 19.0 44.0 18 1847.0 1.0 7.0 0.0 0.0 0.0 0.0 0.0 63682.0 0.95 2277.0 36.0 1.00 1.0 NaN $0 False 0.000000 D1BB3403131897991E9DA0E NaN NaN NaN NaN NaN NaN NaN NaN 997 11.0 76 21398 18500 2007-11-06 00:00:00 Q4 2007 6FEC340123832665047EDE0 641.31 3847.86 2541.98 1305.88 -87.06 0.00 15958.03 15958.02 0.0 1.0 0 0 0.00 527
95755 23CE3394699693090771441 169133 2007-07-13 22:50:21.167000000 C 36 Chargedoff 2008-06-23 00:00:00 0.29776 0.2900 0.2800 NaN NaN NaN NaN NaN NaN 0 NaN Administrative Assistant Self-employed 25.0 True False NaN 2007-07-13 22:16:20.437000000 640.0 659.0 1996-02-01 00:00:00 8.0 8.0 30.0 5 121.0 1.0 4.0 2.0 980.0 3.0 0.0 0.0 4336.0 0.90 419.0 24.0 0.87 0.0 NaN $0 False 0.000000 7DF73394623746619443FBA NaN NaN NaN NaN NaN NaN NaN NaN 2207 11.0 80 17169 3600 2007-07-23 00:00:00 Q3 2007 0290339358821799489FCA9 150.86 905.16 336.96 568.20 -19.21 0.00 3263.04 3263.04 0.0 1.0 0 0 0.00 45
96706 A3263396460605701A10AF3 186941 2007-08-14 08:32:51.593000000 C 36 Chargedoff 2008-02-18 00:00:00 0.29776 0.2900 0.2800 NaN NaN NaN NaN NaN NaN 0 NaN Other Not employed 23.0 True False NaN 2007-07-24 19:16:14.117000000 640.0 659.0 1997-04-18 00:00:00 7.0 5.0 23.0 3 101.0 1.0 13.0 2.0 33134.0 0.0 1.0 0.0 2795.0 0.97 186.0 18.0 0.57 0.0 NaN $0 False 0.000000 A7C33397695921848905543 NaN NaN NaN NaN NaN NaN NaN NaN 2333 6.0 79 18142 4000 2007-08-20 00:00:00 Q3 2007 077133954377005096806A6 167.62 167.62 69.10 98.52 -3.40 0.00 3930.90 3930.90 0.0 1.0 0 0 0.00 54
102994 5B7E3396974677580845CAF 180130 2007-08-01 16:32:59.720000000 A 36 Chargedoff 2010-07-12 00:00:00 0.15713 0.1500 0.1450 NaN NaN NaN NaN NaN NaN 0 NaN Other Self-employed 30.0 False False NaN 2007-08-01 16:23:51.350000000 720.0 739.0 1997-10-10 00:00:00 4.0 4.0 8.0 4 342.0 7.0 9.0 0.0 0.0 0.0 0.0 0.0 16014.0 0.40 24003.0 7.0 0.87 0.0 NaN $0 False 0.000000 1FFC33976492737514F0ED0 NaN NaN NaN NaN NaN NaN NaN NaN 1458 35.0 79 17923 6000 2007-08-13 00:00:00 Q3 2007 F0183396430092841B3FFEE 207.99 6209.70 4766.89 1442.81 -48.09 0.00 1233.12 1233.11 0.0 1.0 0 0 0.00 81
105400 ECBF34056397743716EBA7C 234391 2007-11-19 12:23:11.510000000 AA 36 Completed 2009-03-04 00:00:00 0.10692 0.1000 0.1000 NaN NaN NaN NaN NaN NaN 0 NaN Other Self-employed 46.0 True False NaN 2007-11-19 10:28:03.937000000 840.0 859.0 1993-05-25 00:00:00 8.0 8.0 18.0 3 97.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 4495.0 0.13 30005.0 16.0 1.00 0.0 NaN $0 False 0.000000 83F434055845255175F3152 NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 75 22168 7000 2007-12-04 00:00:00 Q4 2007 074D3404470280145E81508 225.87 7714.32 7000.00 714.32 0.00 0.00 0.00 0.00 0.0 1.0 1 2 200.00 120
109313 ADF033719321927577746DD 54088 2006-10-27 13:52:16.307000000 A 36 Completed 2008-11-02 00:00:00 0.10656 0.0950 0.0900 NaN NaN NaN NaN NaN NaN 0 NaN Student - College Junior Not available NaN False False NaN 2006-10-25 23:01:24.100000000 720.0 739.0 2003-07-04 00:00:00 NaN NaN 24.0 17 152.0 0.0 6.0 0.0 NaN 2.0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN Not displayed False 0.000000 078B337225690482197FB9D NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 88 4427 1500 2006-11-02 00:00:00 Q4 2006 C1A03371837345157D6121D 48.05 1604.22 1500.00 104.22 -5.50 0.00 0.00 0.00 0.0 1.0 0 0 0.00 13
110488 26653378425170590ED9B6F 82305 2007-01-07 15:03:55.007000000 E 36 Completed 2007-07-03 00:00:00 0.24753 0.2400 0.2350 NaN NaN NaN NaN NaN NaN 0 NaN Homemaker Not available NaN True True 6A3B336601725506917317E 2007-01-05 08:43:34.433000000 540.0 559.0 1999-02-17 00:00:00 NaN NaN 18.0 0 0.0 1.0 5.0 0.0 NaN 5.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN Not displayed False 0.000000 0DF0337949607572274E743 NaN NaN NaN NaN NaN NaN NaN NaN 0 NaN 86 6449 5000 2007-01-17 00:00:00 Q1 2007 3EF133647645155044BFFD9 196.16 5523.85 5000.00 523.85 -10.91 0.00 0.00 0.00 0.0 1.0 0 0 0.00 45
In [41]:
#check the percentage of data that will remain if i remove the observations with similar indices in
##null_state and null_income


((data_copy.shape[0] - len(index))/data.shape[0]) * 100
Out[41]:
97.80053889430124
In [42]:
#since I will stil have 97.8% left, I want to drop the observations

data_copy = data_copy.drop(index)


# test
data_copy[columns_with_missing_data.to_list()].isna().sum()
Out[42]:
ProsperRating (Alpha)             0
BorrowerState                  4070
Occupation                     1333
EmploymentStatus                  0
CreditScoreRangeLower             0
CreditScoreRangeUpper             0
TotalCreditLinespast7years        0
DebtToIncomeRatio              8396
IncomeRange                   87303
dtype: int64
In [43]:
# I want to drop the 'IncomeRange' variable because it has 76.74% missing values

data_copy.drop(['IncomeRange'],  axis =1, inplace = True)
In [44]:
#check amount of missing data left

columns_with_missing_data = columns_with_missing_data.to_list()
columns_with_missing_data.remove('IncomeRange')
data_copy[columns_with_missing_data].isna().sum()
Out[44]:
ProsperRating (Alpha)            0
BorrowerState                 4070
Occupation                    1333
EmploymentStatus                 0
CreditScoreRangeLower            0
CreditScoreRangeUpper            0
TotalCreditLinespast7years       0
DebtToIncomeRatio             8396
dtype: int64

I want to interpolate the other missing data

In [45]:
data_copy[data_copy['Occupation'].notna()].sample(5)
Out[45]:
LoanStatus LenderYield ProsperRating (Alpha) ListingCategory (numeric) BorrowerState Occupation EmploymentStatus IsBorrowerHomeowner CurrentlyInGroup CreditScoreRangeLower CreditScoreRangeUpper TotalCreditLinespast7years DebtToIncomeRatio IncomeVerifiable LoanOriginationQuarter MonthlyLoanPayment Recommendations InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors PercentFunded
17510 Completed 0.1038 A 2 MI Other Full-time True False 780.0 799.0 28.0 0.06 True Q2 493.78 0 0 0.0 398 1.0
109094 Chargedoff 0.2400 HR 0 GA Realtor Self-employed True True 540.0 559.0 17.0 0.14 True Q3 269.95 1 1 2720.0 120 1.0
11635 Current 0.1214 B 1 OH Teacher Employed False False 740.0 759.0 50.0 0.24 True Q1 45.65 0 0 0.0 1 1.0
103876 Completed 0.2099 D 2 CA Clerical Employed False False 800.0 819.0 23.0 0.13 True Q3 400.95 0 0 0.0 1 1.0
7554 Current 0.1579 B 1 MO Nurse (LPN) Employed True True 640.0 659.0 43.0 0.34 True Q4 533.22 1 0 0.0 273 1.0
In [46]:
data_copy[data_copy['Occupation'].isna()].sample(5)
Out[46]:
LoanStatus LenderYield ProsperRating (Alpha) ListingCategory (numeric) BorrowerState Occupation EmploymentStatus IsBorrowerHomeowner CurrentlyInGroup CreditScoreRangeLower CreditScoreRangeUpper TotalCreditLinespast7years DebtToIncomeRatio IncomeVerifiable LoanOriginationQuarter MonthlyLoanPayment Recommendations InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors PercentFunded
42971 Current 0.1089 A 1 PA NaN Other True False 700.0 719.0 28.0 0.25 True Q4 331.62 0 0 0.0 1 1.0
4161 Current 0.2599 E 1 MI NaN Other True False 640.0 659.0 21.0 0.24 True Q4 81.64 0 0 0.0 16 1.0
32823 Current 0.1625 C 1 NY NaN Other False False 680.0 699.0 17.0 0.17 True Q1 299.85 0 0 0.0 1 1.0
52879 Current 0.2210 D 1 NY NaN Other False False 680.0 699.0 33.0 0.44 True Q4 232.57 0 0 0.0 60 1.0
53426 Current 0.1214 B 1 WV NaN Other True False 720.0 739.0 29.0 0.37 True Q4 506.42 0 0 0.0 1 1.0

I noticed observations with missing Occupation data has EmploymentStatus as Other, thus I will fill it with 'Not Stated'

In [47]:
data_copy['Occupation'] = data_copy['Occupation'].cat.add_categories('Not Stated')
data_copy['Occupation'].fillna('Not Stated', inplace = True)
In [48]:
data_copy['BorrowerState'] = data_copy['BorrowerState'].cat.add_categories('Not Stated')
data_copy['BorrowerState'].fillna('Not Stated', inplace = True)
In [49]:
data_copy[data_copy['DebtToIncomeRatio'].notna()].sample(5)
Out[49]:
LoanStatus LenderYield ProsperRating (Alpha) ListingCategory (numeric) BorrowerState Occupation EmploymentStatus IsBorrowerHomeowner CurrentlyInGroup CreditScoreRangeLower CreditScoreRangeUpper TotalCreditLinespast7years DebtToIncomeRatio IncomeVerifiable LoanOriginationQuarter MonthlyLoanPayment Recommendations InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors PercentFunded
96022 Current 0.1059 A 1 PA Construction Employed True False 700.0 719.0 36.0 0.27 True Q1 330.57 0 0 0.0 1 1.0
12086 Defaulted 0.2600 C 0 GA Professional Full-time False True 640.0 659.0 44.0 0.30 True Q3 335.25 0 0 0.0 82 1.0
113575 Past Due (1-15 days) 0.3099 E 1 NC Other Other True False 660.0 679.0 51.0 0.53 True Q3 326.62 0 0 0.0 85 1.0
34705 Current 0.0669 AA 1 IN Construction Employed True False 700.0 719.0 21.0 0.12 True Q1 358.73 0 0 0.0 134 1.0
7781 Chargedoff 0.3099 E 7 CO Sales - Retail Employed False False 680.0 699.0 11.0 0.11 True Q3 152.42 0 0 0.0 26 1.0
In [50]:
data_copy[data_copy['DebtToIncomeRatio'].isna()].sample(5)
Out[50]:
LoanStatus LenderYield ProsperRating (Alpha) ListingCategory (numeric) BorrowerState Occupation EmploymentStatus IsBorrowerHomeowner CurrentlyInGroup CreditScoreRangeLower CreditScoreRangeUpper TotalCreditLinespast7years DebtToIncomeRatio IncomeVerifiable LoanOriginationQuarter MonthlyLoanPayment Recommendations InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors PercentFunded
69316 Current 0.1315 B 1 CA Other Self-employed True False 680.0 699.0 23.0 NaN False Q1 342.51 0 0 0.0 1 1.0
83135 Current 0.1255 B 1 UT Truck Driver Self-employed True False 720.0 739.0 37.0 NaN False Q1 509.39 0 0 0.0 1 1.0
87073 Current 0.1730 B 3 CO Computer Programmer Self-employed False False 660.0 679.0 15.0 NaN False Q3 293.90 0 0 0.0 127 1.0
9802 Current 0.2319 D 7 SC Administrative Assistant Employed False False 700.0 719.0 17.0 NaN False Q1 78.67 0 0 0.0 45 1.0
51028 Current 0.1419 B 1 IL Other Other True False 740.0 759.0 33.0 NaN False Q2 238.90 0 0 0.0 74 1.0
In [51]:
#perform linear interpolation on DebtToIncomeRatio

data_copy['DebtToIncomeRatio'] = data_copy['DebtToIncomeRatio'].interpolate()
data_copy['DebtToIncomeRatio'].isna().sum()
Out[51]:
0
In [52]:
#check for null values

data_copy.isna().sum()
Out[52]:
LoanStatus                     0
LenderYield                    0
ProsperRating (Alpha)          0
ListingCategory (numeric)      0
BorrowerState                  0
Occupation                     0
EmploymentStatus               0
IsBorrowerHomeowner            0
CurrentlyInGroup               0
CreditScoreRangeLower          0
CreditScoreRangeUpper          0
TotalCreditLinespast7years     0
DebtToIncomeRatio              0
IncomeVerifiable               0
LoanOriginationQuarter         0
MonthlyLoanPayment             0
Recommendations                0
InvestmentFromFriendsCount     0
InvestmentFromFriendsAmount    0
Investors                      0
PercentFunded                  0
dtype: int64
In [53]:
#check percentage of data left after cleaning

print('percentage of data left after cleaning is {:0.02f}%'.format(data_copy.shape[0]/data.shape[0] * 100))
percentage of data left after cleaning is 97.80%

from the data description the column 'ListingCategory (numeric)' is encoded, I want to replace the values

In [54]:
data_copy['ListingCategory (numeric)'] = data_copy['ListingCategory (numeric)'].replace(\
{0: 'Not AVilable', 1: 'Debt Consolidation', 2: 'HOme Improvement',3: 'Business', 4: 'Personal Loan',\
 5: 'Student Use', 6: 'Auto', 7: 'Other',  8: 'Baby & Adoption', 9: 'Boat', 10: 'Cosmetic Procedures',\
 11: 'Engagement Ring', 12: 'Green Loans', 13: 'Household Expenses', 14: 'Large Purchases',\
 15: 'Medical/ Dental', 16: 'Motorcycle', 17: 'RV', 18: 'Taxes', 19: 'Vacation', 20: 'Wedding Loans'})
data_copy['ListingCategory (numeric)'] = data_copy['ListingCategory (numeric)'].astype('category')

#test
print(data_copy['ListingCategory (numeric)'].dtype)
data_copy['ListingCategory (numeric)'].value_counts()
category
Out[54]:
Debt Consolidation     58242
Not AVilable           14596
Other                  10469
HOme Improvement        7421
Business                7172
Auto                    2564
Personal Loan           2392
Household Expenses      1996
Medical/ Dental         1522
Taxes                    885
Large Purchases          876
Wedding Loans            771
Vacation                 768
Student Use              750
Motorcycle               304
Engagement Ring          217
Baby & Adoption          199
Cosmetic Procedures       91
Boat                      85
Green Loans               59
RV                        52
Name: ListingCategory (numeric), dtype: int64
In [55]:
#create a column to give investments not from friends
data_copy['investments not from friends'] = data_copy['Investors'] -\
data_copy['InvestmentFromFriendsCount']
In [56]:
data_copy.rename(columns = {'ListingCategory (numeric)': 'ListingCategory'}, inplace = True)
data_copy.columns
Out[56]:
Index(['LoanStatus', 'LenderYield', 'ProsperRating (Alpha)', 'ListingCategory',
       'BorrowerState', 'Occupation', 'EmploymentStatus',
       'IsBorrowerHomeowner', 'CurrentlyInGroup', 'CreditScoreRangeLower',
       'CreditScoreRangeUpper', 'TotalCreditLinespast7years',
       'DebtToIncomeRatio', 'IncomeVerifiable', 'LoanOriginationQuarter',
       'MonthlyLoanPayment', 'Recommendations', 'InvestmentFromFriendsCount',
       'InvestmentFromFriendsAmount', 'Investors', 'PercentFunded',
       'investments not from friends'],
      dtype='object')
In [57]:
data_copy.to_csv('data/cleaned_Loan_prosper_data.csv')

What is the structure of your dataset?

The dataset is a 2-dimension data with 113937 loan records and 81 features.61.7% of the data is float, 21.0% is categorical, 13.6% is integer and 3.7% is boolean.

What is/are the main feature(s) of interest in your dataset?

The main feature of interest is the PercentFunded. I want to see the factors that will affect 100% full funding

What features in the dataset do you think will help support your investigation into your feature(s) of interest?

I will check the features: LoanStatuslenderYield prosper rating (Alpha)ListingCategory, BorrowerState, Occupation, EmploymentStatus, IsBorrowerHomeOwner, CurrentlyInGroup, CreditScoreRangeLower, CreditScoreRangeUpper, TotalCreditLinesInPast7Years, DebtToIncomeRatio, IncomeRange, IncomeVerifiable, LopanOriginationQuarter, MonthlyLoanPayment, Recommendations, InvestmentfromFriendsCounts, InvestmentfromFriendAmount, Investors to see how they correlate with percentfunded and to see if they can determine a 100% full funding. I will drop other columns because they are unique to each observation and are repition of some other columns.

Univariate Exploration

In [58]:
data_copy.dtypes.to_frame().T
Out[58]:
LoanStatus LenderYield ProsperRating (Alpha) ListingCategory BorrowerState Occupation EmploymentStatus IsBorrowerHomeowner CurrentlyInGroup CreditScoreRangeLower CreditScoreRangeUpper TotalCreditLinespast7years DebtToIncomeRatio IncomeVerifiable LoanOriginationQuarter MonthlyLoanPayment Recommendations InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors PercentFunded investments not from friends
0 category float64 category category category category category bool bool float64 float64 float64 float64 bool category float64 int64 int64 float64 int64 float64 int64
In [59]:
data_copy.sample(5)
Out[59]:
LoanStatus LenderYield ProsperRating (Alpha) ListingCategory BorrowerState Occupation EmploymentStatus IsBorrowerHomeowner CurrentlyInGroup CreditScoreRangeLower CreditScoreRangeUpper TotalCreditLinespast7years DebtToIncomeRatio IncomeVerifiable LoanOriginationQuarter MonthlyLoanPayment Recommendations InvestmentFromFriendsCount InvestmentFromFriendsAmount Investors PercentFunded investments not from friends
96070 Chargedoff 0.2900 D Debt Consolidation WI Food Service Full-time True False 640.0 659.0 21.0 0.23 True Q4 50.94 0 0 0.0 12 1.0 12
32576 Current 0.1319 B Debt Consolidation SD Computer Programmer Employed True False 680.0 699.0 46.0 0.34 True Q2 350.50 0 0 0.0 184 1.0 184
18689 Current 0.1740 C Debt Consolidation CA Administrative Assistant Employed True False 680.0 699.0 40.0 0.31 True Q4 545.30 0 0 0.0 1 1.0 1
43022 Completed 0.2499 D Auto NY Other Employed False False 680.0 699.0 9.0 0.07 True Q4 201.43 0 0 0.0 74 1.0 74
94640 Completed 0.0799 A Debt Consolidation OR Accountant/CPA Employed True False 680.0 699.0 34.0 0.18 True Q1 461.03 0 0 0.0 257 1.0 257
In [60]:
data_copy.columns
Out[60]:
Index(['LoanStatus', 'LenderYield', 'ProsperRating (Alpha)', 'ListingCategory',
       'BorrowerState', 'Occupation', 'EmploymentStatus',
       'IsBorrowerHomeowner', 'CurrentlyInGroup', 'CreditScoreRangeLower',
       'CreditScoreRangeUpper', 'TotalCreditLinespast7years',
       'DebtToIncomeRatio', 'IncomeVerifiable', 'LoanOriginationQuarter',
       'MonthlyLoanPayment', 'Recommendations', 'InvestmentFromFriendsCount',
       'InvestmentFromFriendsAmount', 'Investors', 'PercentFunded',
       'investments not from friends'],
      dtype='object')

What is the distribution of variable of interest- Percent funded ?

In [61]:
plt.figure(figsize = (10,4))

plt.subplot(121)
sb.kdeplot(data = data_copy['PercentFunded'], color = 'brown')
plt.title('Distribution of all percentfunded', fontdict = {'fontsize': 18, 'color': 'brown'}, pad = 0.4)
plt.xlabel('Percent Funded', fontdict = {'weight': 'bold'})
plt.ylabel('counts', fontdict = {'weight': 'bold'})
plt.ylim(-4);


#Because the proportion of 1.0 is very large, I want to view the distribution of other percent funded
# in another subplot
plt.subplot(122)
not_including_1 = data_copy[data_copy['PercentFunded'] < 1]
bins = np.arange(not_including_1['PercentFunded'].min(), not_including_1['PercentFunded'].max() + 0.025, 0.025)
sb.histplot(data =  not_including_1, x = 'PercentFunded', bins = bins, color = 'teal')
plt.title('Distribution of percentfunded excluding 1.0', fontdict = {'fontsize': 18, 'color': 'teal'}, pad = 0.4)
plt.xlim(0.7,0.99,0.25)
plt.xlabel('Percent Funded', fontdict = {'weight': 'bold'})
plt.ylabel('Density', fontdict = {'weight': 'bold'})

plt.subplots_adjust(wspace = 1.0)
In [62]:
#I want to check the percentage of loans that has PercentFunded of 1

(data_copy['PercentFunded'].value_counts().iloc[0]/ data_copy['PercentFunded'].shape[0]) * 100
Out[62]:
99.22014520196355

The `PercentFunded` feature is highly skewed with 99.18% of the data being 1.0, thus performing transformation on the feature had no effect on the skewness. But from the observation, 99.18 percent of loans were fully funded.
Of the 0.82% left, most of the loans were 70% funded

What are the distribution of all categorical data?

In [63]:
def count(data, onx, sides_to_remove, title, fig_width, fig_height, rotation,\
          sort_as_ordinal = None, rotatex = None, subplot = None):
    
    '''To plot distributions of categorical data in form of counts
    
    data: str, The elements of the categorical data,
    onx: bool, If the categorical data should be on the x-axis,
    sides_to_remove: list, The borders of the plot to remove,
    title: str, The title of the plot,
    fig_width: int, The width of the plot figure, 
    fig_height: int, The height of the plot figure, 
    rotation: int, text rotation,
    sort_as_ordinal: bool,If the categorical elements should be sorted in order, 
    rotatex: bool, If x tick should be rotated, 
    subplot: int, geometry of subplots'''
    
    
    value = data_copy[data].value_counts().index.to_list() if sort_as_ordinal == None else\
     data_copy[data].value_counts().sort_index().index.to_list()
    fig = plt.figure(figsize = (fig_width, fig_height))
    ax1 = plt.subplot(111) if subplot == None else plt.subplot(subplot)
    ax1.set_title(title, fontdict = {'color':'teal', 'fontsize':18,\
                                                    'fontweight': 'bold'}, pad = 10)
    if onx:
        sb.countplot(x = data, data = data_copy, order = value, color = 'teal', ax = ax1)
        for i in range(len(value)):
            count = data_copy[data].value_counts()[value[i]]
            percent = (count/data_copy[data].shape[0]) * 100
            plt.text(x = i, y = count , s = '{:0.2f}% value:({c})'.format(percent, c = count), rotation = 45 if\
                     rotation else 0)
        plt.xlabel(f'{data}', fontdict = {'weight': 'bold'})
        plt.ylabel(f'count of {data}', fontdict = {'weight': 'bold'})
        plt.xticks(rotation = 0 if rotatex == True else 90);
    else:
        sb.countplot(y = data, data = data_copy, order = value, color = 'teal', ax = ax1)
        plt.ylabel(f'{data}', fontdict = {'weight': 'bold'})
        plt.xlabel(f'count of {data}', fontdict = {'weight': 'bold'})
        for i in range(len(value)):
            count = data_copy[data].value_counts()[value[i]]
            percent = (count/data_copy[data].shape[0]) * 100
            plt.text(y = i, x = count , s = '{:0.2f}% value:({c})'.format(percent, c = count), rotation = 45 if\
                     rotation else 0)
    for i in sides_to_remove:
        ax1.spines[i].set_visible(False)
    
In [64]:
sides = ['top', 'right']
count(data = 'ProsperRating (Alpha)', onx = True, sides_to_remove = sides,\
      title ='Count of ProsperRating (Alpha)', fig_width = 10, fig_height = 6, rotation = True)

ProsperRating (Alpha) of C occurs the most and Nc occurs the least

In [65]:
sides = ['top', 'right']
count(data = 'ListingCategory', onx = True, sides_to_remove = sides,\
      title ='Count of ListingCategory', fig_width = 10, fig_height = 4, rotation = True)

Most borrowers had taken a loan for debt Consolidation.

In [66]:
sides = ['top','right']
count(data = 'BorrowerState', onx = True, sides_to_remove = sides, title ='Count of state',\
     fig_width = 15, fig_height = 8, rotation = True)

Most borrowers are from California while the least are from North Dakota and about 3.65% of the loan observations have the states not stated

In [67]:
sides = ['right', 'top']
count(data = 'LoanStatus', onx = False, sides_to_remove = sides, title ='Count of loan status',\
     fig_width = 8, fig_height = 6, rotation = False)

Most of the loan staus are Current, completed

In [68]:
sides = ['right', 'top']
count(data = 'EmploymentStatus', onx = False, sides_to_remove = sides, title ='Count of employment status',\
     fig_width = 8, fig_height = 6, rotation = False)

Most borrowers are Employed, with very few retired and not employed
also more borrowers work full time than part-time

In [69]:
sides = ['right', 'top']
count(data = 'Occupation', onx = False, sides_to_remove = sides, title ='Count of Occupation',\
     fig_width = 15, fig_height = 15, rotation = False)

Most borrowers' Occupation is not specified

In [70]:
count(data = 'LoanOriginationQuarter', onx = True, sides_to_remove = ['top'], title ='Count of Quarter',\
     fig_width = 8, fig_height = 6, rotation = True, sort_as_ordinal = True, rotatex = True)

The amount of loans originated in each quarter are almost similar, but Q4 had the highest

In [71]:
count('Recommendations', onx = False, sides_to_remove = ['right', 'top'], title = \
      'Distribution of Recommendations', fig_width = 10, fig_height =8, rotation = False, \
      sort_as_ordinal = True, rotatex = None, subplot = 211)
count('InvestmentFromFriendsCount', onx = False, sides_to_remove = ['right', 'top'], title = \
      'Distribution of Investment From Friends Count', fig_width = 10, fig_height =8,
      rotation = False, sort_as_ordinal = True, rotatex = None, subplot = 212)

Most borrowers had no Investment from friends and no recommendations

In [72]:
f, ax = plt.subplots(2, 2, figsize = (10,10))
color = ['brown', 'teal']
label = [ 'False','True']

def proportion_pie(data, axis_row, axis_column, title, explode, rotation1, rotate_index, wedgeprops = None,):
    
    '''To plot distributions of categorical data in form of proportion
    
    data: str, The elements of the categorical data,
    axis_row: int, row position to plot on,
    axis_column: int, column position to plot on,
    title: str, title of plot,
    explode: list, explode values  to pass to pie plot,
    rotation1: bool, If all texts should be rotated,
    rotate_index: int, The index of the text to rotate if rotation1 is false,
    wedgeprops: dict, The width of each proportion, to help form a donut chart
    '''
    
    data = data_copy[data].value_counts().sort_index()
    axes = ax[axis_row][axis_column]
    axes.set_title(title, backgroundcolor = 'black',\
         fontdict = {'fontweight': 'bold', 'color': 'white'}, pad = 71)
    explode = explode
    if wedgeprops != None:
        patches, texts, autotexts = axes.pie(data, colors = color, autopct = '%0.2f%%', startangle = 87,
        explode = explode, wedgeprops = wedgeprops,\
            textprops = {'color': 'white', 'fontweight' : 'bold'},  radius = 2);
    else:
        patches, texts, autotexts = axes.pie(data, colors = color, autopct = '%0.2f%%', startangle = 87,
        explode = explode, textprops = {'color': 'white', 'fontweight' : 'bold'},  radius = 2);
    if rotation1:
        [text.set_rotation(90) for text in autotexts] 
    else:
        autotexts[rotate_index].set_rotation(75)
    legend_properties = {'weight': 'bold'}
    f.legend(label, loc = 4 , facecolor = 'grey',borderpad = 4, labelspacing = 4,\
            fontsize = 40, prop = legend_properties, labelcolor = 'white',\
            bbox_to_anchor = (0, 0.5));
    plt.subplots_adjust(wspace = 0.9, hspace = 0.9)
    ax[1][1].set_visible(False)
    return axes
proportion_pie('IsBorrowerHomeowner', 0, 0, 'proportion of Borrowers that are home owners', [0.07, 0],\
   True,{'width': 1.0})

proportion_pie('CurrentlyInGroup', 0, 1, 'proportion of Borrowers that are in a group', [0, 0.07],\
              False, 1)

proportion_pie('IncomeVerifiable', 1, 0, 'proportion of verifiable Income', [0, 0.07],\
              False, 0)

plt.show()

  1. The proportion difference between non home owners and home owners is small, but still more borrowers are home owners
  2. Very few borrowers(only 10.95%) are in groups
  3. Most borrowers had verifiable incomes about 92.30%

Visualize the Quantitative data

In [73]:
def score_money_counts(fig_width, fig_height, data, title, subplot = None):
    
    '''To plot distributions of numerical data in form of frequency
    
    fig_width: int, The width of the plot figure, 
    fig_height: int, The height of the plot figure, 
    data: str, The elements of the numerical data,
    title: str, The title of the plot,
    subplot: int, geometry of subplots'''
    
    plt.figure(figsize = (fig_width, fig_height))
    ax1 = plt.subplot(111) if subplot == None else plt.subplot(subplot)
    scores = data_copy[data].value_counts().sort_index()
    index = [str(index) for index in scores.index]
    sb.barplot(x = scores.values, y = index, color = 'brown')
    plt.title(title, fontdict = \
              {'color': 'brown', 'fontweight': 'bold'})
    for i in range(len(index)):
        percent = (scores.iloc[i]/data_copy['CreditScoreRangeLower'].shape[0]) * 100
        plt.text(x = scores.iloc[i], y = i, s = '{:0.2f}%'.format(percent))
    ax1.spines['right'].set_visible(False)
    plt.ylabel(f'{data}', fontdict = {'weight': 'bold'})
    plt.xlabel(f'count of {data}', fontdict = {'weight': 'bold'})
In [74]:
score_money_counts(8, 8, 'CreditScoreRangeLower', 'distribution of lower range of credit scores')

The highest amount of Lower credit score ranges from 640 to 720

In [75]:
score_money_counts(8, 8, 'CreditScoreRangeUpper', 'distribution of Upper range of credit scores')

The highest amount of Lower credit score ranges from 659 to 739

In [76]:
def money_hist(fig_width, fig_height, data, number_of_bins, title, more_than_a_subplot = False, \
               use_another_data = False, another_data = None,):
    
    '''To plot distributions of continuous numerical data in form of frequency
    
    fig_width: int, The width of the plot figure, 
    fig_height: int, The height of the plot figure, 
    data: str, The elements of the numerical data,
    number_of_bins: int, number of bins,
    title: str, The title of the plot,
    more_than_a_subplot: bool, If there should be more than a plot,
    use_another_data: bool, If another data should be used in the second subplot,
    another_data: str, The data to use if use_another_data is True'''
    
    plt.figure(figsize = (fig_width, fig_height))
    ax = plt.subplot(121)
    data_to_use = data_copy
    bin_number = data_to_use[data[0]].max() / number_of_bins
    bins = np.arange(data_to_use[data[0]].min(), data_to_use[data[0]].max() +bin_number, bin_number)
    sb.histplot(x = data[0], data = data_to_use, ax = ax, bins = bins)
    ax.spines['top'].set_visible(False)
    plt.title(title if isinstance(title, str) else title[0], 
              fontdict = {'fontsize': 40})
    plt.xlabel(f'{data[0]}', fontdict = {'weight': 'bold'})
    plt.ylabel(f'frequency of {data[0]}', fontdict = {'weight': 'bold'})
    
    if more_than_a_subplot:
        ax = plt.subplot(122)
        if use_another_data:
            data_to_use = another_data
        else:
            data_to_use = data_copy
        bin_number = data_to_use[data[1]].max() / number_of_bins
        bins = np.arange(data_to_use[data[1]].min(), data_to_use[data[1]].max() +bin_number, bin_number)
        sb.histplot(x = data[1], data = data_to_use, ax = ax, bins = bins)
        ax.spines['top'].set_visible(False)
        plt.title(title[1], fontdict = {'fontsize': 40})
        plt.xlabel(f'{data[1]}', fontdict = {'weight': 'bold'})
        plt.ylabel(f'frequency of {data[1]}', fontdict = {'weight': 'bold'})
    plt.subplots_adjust(wspace = 0.9)
    plt.tight_layout()
In [77]:
def log_money_hist(fig_width, fig_height, data, number_of_bins, title, more_than_a_subplot = False, \
               use_another_data = False, another_data = None, lim_tick1 = None, lim_tick2 = None):
    
    '''To plot log distributions of continuous numerical data in form of frequency
    
    fig_width: int, The width of the plot figure, 
    fig_height: int, The height of the plot figure, 
    data: str, The elements of the numerical data,
    number_of_bins: int, number of bins,
    title: str, The title of the plot,
    more_than_a_subplot: bool, If there should be more than a plot,
    use_another_data: bool, If another data should be used in the second subplot,
    another_data: str, The data to use if use_another_data is True
    lim_tick1: list, The limit and tick of the first subplot,
    lim_tick2: list, The limit and tick of the second subplot'''
        
    plt.figure(figsize = (fig_width, fig_height))
    def hist_plot(axes, data_index, data_selected, limit = None, tick = None):
        ax = plt.subplot(axes) 
        bin_number = data_copy[data[data_index]].max() / 10 ** np.log(number_of_bins)
        bins = 10 ** np.arange(np.log(data_copy[data[data_index]] + 1).min(), \
                               np.log(data_copy[data[data_index]] + 1).max() +bin_number, bin_number)
        sb.histplot(x = data[data_index], data = data_selected, ax = ax, bins = bins)
        ax.spines['top'].set_visible(False)
        plt.xscale('log')
        plt.title(title if isinstance(title, str) else title[data_index],\
                  fontdict = {'fontsize': 40})
        plt.xlabel(f'{data[data_index]}', fontdict = {'weight': 'bold'})
        plt.ylabel(f'frequency of {data[data_index]}', fontdict = {'weight': 'bold'})
        if limit != None:
            plt.xlim(limit)
        if tick != None:
            plt.xticks(tick[0], tick[1])
    if lim_tick1 == None:    
        hist_plot(121, 0, data_copy)
    else:
        hist_plot(121, 0, data_copy, limit = lim_tick1[0], tick = lim_tick1[1])
    if more_than_a_subplot:
        if use_another_data: # truth value ambiguous if I use argument to take dataframe in condition
            data_to_use = another_data
        else:
            data_to_use = data_copy
        if lim_tick2 == None:
            hist_plot(122, 1, data_to_use)
        else:
            hist_plot(122, 1, data_copy, limit = lim_tick2[0], tick = lim_tick2[1])
In [78]:
money_hist(fig_width = 20, fig_height = 5, data = ['LenderYield'],\
           number_of_bins = 27, title = 'Distribution of LenderYield',\
                         )

The `LenderYield` is a unimodal distribution with the highest close to 0.15.
It seems there are outliers less than 0.05

In [79]:
len(data_copy[data_copy['LenderYield'] < 0.05])
Out[79]:
416

Because they are 390 in number that are less than 0,05, I think they are few cases and not outliers

In [80]:
money_hist(fig_width = 20, fig_height = 5, data = ['MonthlyLoanPayment'],\
           number_of_bins = 27, title = 'Distribution of MonthlyLoanPayment',\
                         )

MonthlyLoanPayment plotted on a standard scale is skewed.

In [81]:
log_money_hist(fig_width = 20, fig_height = 8, data = ['MonthlyLoanPayment'], number_of_bins = 70,\
               title = 'Logarithm Distribution of MonthlyLoanPayment', lim_tick1 = [(1e1, 1e4),[[1e1, 1.5e1,5e1, 1e2, 1.75e2,2.5e2,5e2, 1e3, 2.5e3,5e3],\
          ['10', '15','50', '100', '175','250','500', '1000', '2500','5000']]])

On the logarithm scale the mode is between 100 and 175

In [82]:
money_hist(fig_width = 20, fig_height = 5, data = ['TotalCreditLinespast7years'],\
           number_of_bins = 27, title = 'Distribution of Total Credit Lines in past 7 years',\
                         )

Total credit lines between 20 and 30 are more but the distribution is skewed a little.

In [83]:
log_money_hist(fig_width = 20, fig_height = 8, data = ['TotalCreditLinespast7years'], number_of_bins = 30,\
               title = 'Logarithm Distribution of TotalCreditLinespast7years', lim_tick1 = \
              [(1e1, 2.5e2), [[5e0-1, 1e1-1, 1.5e1-1, 2e1-1,2.5e1-1, 3.5e1-1, 5e1-1, 1e2-1],\
                              ['5','10', '15', '20','25', '35', '50', '100']]])

The logarithmic distribution of total credit lines is roughly bimodal, One mode between 20 and 25 and the other between 35 and 50. There is a steep jump before 15

In [84]:
money_hist(fig_width = 20, fig_height = 5, data =['DebtToIncomeRatio'], number_of_bins = 50,\
          title = 'Distribution of Debt To Income Ratio')

The distribution of Debt to income ratio is skewed and it seems there is an outlier value close to 10

In [85]:
len(data_copy[data_copy['DebtToIncomeRatio'] > 10])
Out[85]:
266

208 numbers of oservation highern than 10 indicate they are not outliers

In [86]:
log_money_hist(fig_width = 20, fig_height = 8, data = ['DebtToIncomeRatio'], number_of_bins = 10,\
               title = 'Logarithm Distribution of Debt To Income Ratio')
plt.xlim(0.9e0, 1.5e1);
plt.xticks([1e0, 2e0, 3e0, 4e0, 5e0, 6e0, 7e0, 8e0, 9e0, 1e1, 1.2e1,1.5e1],\
           ['1','2','3','4','5','6','7','8','9','10', '12','15']);

The logarithm distribution of Debt-to-Income ratio is unimodal with even ratios close to 10 being the highest

In [87]:
money_hist(fig_width = 50, fig_height = 15, data = ['Investors', 'investments not from friends']\
               ,number_of_bins = 20, title = ['Distribution of Investors', \
                        'Distribution of Investments not from friends'],\
               more_than_a_subplot = True)

The distribution from Investors and Investors not friends is highly skewed

In [88]:
log_money_hist(fig_width = 50, fig_height = 10, data = ['Investors', 'investments not from friends']\
               ,number_of_bins = 50, title = ['Logarithm Distribution of Investors', \
                        'Logarithm Distribution of Investments not from friends'],\
               more_than_a_subplot = True, lim_tick1 = [(1e0, 1.5e3),
                                                       [[1e0, 5e0, 1e1, 5e1,1e2, 5e2, 1e3],
                                                        ['1','5','10', '50', '100', '500', '1000']]],
              lim_tick2 = [(0.9e0, 1.5e3),[[1e0, 2e0,5e0, 1e1, 5e1,1e2, 5e2, 1e3],
                        ['1', '2','5','10', '50', '100', '500', '1000']]])

The Investors data on the logarithmic scale has a normal distribution with the mode being between 40 and 200.
The Investment not from friends data has the mode between 1 and 1.5 with the other figures forming a roughly normal distribution with a little skew to the left

In [89]:
newdata = data_copy[data_copy['InvestmentFromFriendsAmount'] > 1000]
money_hist(fig_width = 37, fig_height = 15, data = ['InvestmentFromFriendsAmount', 'InvestmentFromFriendsAmount'],\
           number_of_bins = 30, title = ['Distribution of InvestmentFromFriendsAmount',\
'Distribution of Investment From Friends Amount greater than 1000'],
           use_another_data = True,another_data = newdata, more_than_a_subplot = True)

The data on investments amount from friends ranging between 0 and 1000 occurs far more than other amounts so I made another plot to check distribution of the data part greater than 1000

In [90]:
log_money_hist(fig_width = 20, fig_height = 10, data = ['InvestmentFromFriendsAmount'],\
           number_of_bins = 150, title = 'Logarithm Distribution of InvestmentFromFriendsAmount',
               lim_tick1 = [(0.9e1, 5e4),[[1e1, 5e1,1e2, 5e2, 1e3, 5e3, 1e4, 5e4, 1e5],
                                    ['10', '50', '100', '500', '1k', '5k', '10k', '50k', '100k']]],)

The distribution of investments amount from friends on the logarithmic scale is roughly bimodal with the mode covering values between 100 and 150 and the other between 30 and 45

Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

Most of the values, more than 90% in the variable were 1, so the first visualisation was highly skewed and the frequency of other values could not be seen. I tried to transform the scale but because of the very large proportion for PercentFunded value of 1, It could not solve the problem so I divided the data into two:

  • Loans that were not fully funded(PercentFunded < 1)
  • Loans that were fully funded (PercentFunded > 1)

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

  • The monthly loan payment was skewed and It almost seemed there was an outlier between values 1500 and 2000, so I performed log transformations on the data to understand the distribution better
  • The Debt-to-income-ratio was skewed and It almost seemed there was an outlier at value 10, so I performed log transformations on the data to understand the distribution better
  • The Investors, Investments not from friends were skewed, so I performed log transformations on the data to understand the distribution better
  • The InvestmentFromFriendsAmount was highly skewed, I plotted for monthly payment > 1000 to see the distribution of monthly payment > 1000 and I applied log transformation on the values

Bivariate Exploration

I want to check the correlation of all variables with percent funded
I want to also check the relationship of other variables

In [91]:
#check for columns that have high positive correlation with PercentFunded

data_copy.corr()[data_copy.corr()['PercentFunded'] > 0.4]['PercentFunded']
Out[91]:
PercentFunded    1.0
Name: PercentFunded, dtype: float64
In [92]:
#check for columns that have high negative correlation with PercentFunded

data_copy.corr()[data_copy.corr()['PercentFunded'] < -0.4]['PercentFunded']
Out[92]:
Series([], Name: PercentFunded, dtype: float64)

There is no strong correlation between percentfunded and other variables

In [93]:
#check for columns that have high positive correlation

a = data_copy.corr()[data_copy.corr() > 0.4]
sb.heatmap(a)
plt.title('Positive correlation between features greater than 0.4', fontdict = {'weight': 'bold'})
Out[93]:
Text(0.5, 1.0, 'Positive correlation between features greater than 0.4')

There is reasonable correlation between

* Recommendation and InvestorsFromFriendsCount
* InvestorsFromFriendsCount and InvestorsFromFriendsAmount which is expected
In [94]:
#check for columns that have high negative correlation

a = data_copy.corr()[data_copy.corr() < -0.4]
sb.heatmap(a)

plt.title('negative correlation between features less than 0.4', fontdict = {'weight': 'bold'})
Out[94]:
Text(0.5, 1.0, 'negative correlation between features less than 0.4')

There is reasonable correlation between LenderYield and

* CreditScoreRangeUpper
* CreditScoreRangeLower
In [95]:
plt.figure(figsize = (15,20))
sb.pairplot(data = data_copy, y_vars = 'PercentFunded')
plt.title('Pairplot of other features against Percent Funded to confirm', fontdict ={'weight': 'bold'})
Out[95]:
Text(0.5, 1.0, 'Pairplot of other features against Percent Funded to confirm')
<Figure size 1080x1440 with 0 Axes>

This plot confirms percentfunded has no correlation with other columns

In [96]:
def boxplots(fig_width, fig_height,  cat_data, num_data, data, title,
            multiple_subplots = False, use_another_data = False, another_data = None):
    
    '''To plot relationship between categorical data and numeric data
    
    fig_width: int, The width of the plot figure, 
    fig_height: int, The height of the plot figure, 
    cat_data: list, The elements of the categorical data,
    num_data: list, The elements of the numerical data,
    data: DataFrame, The data to plot from,
    title: list, The title of the plot,
    multiple_subplots: bool, If there should be more than a plot,
    use_another_data: bool, If another data should be used in the second subplot,
    another_data: str, The data to use if use_another_data is True
    '''
    
    plt.figure(figsize = (fig_width, fig_height))
    plt.subplot(211)
    sb.boxplot(x = cat_data[0], y = num_data[0], data = data, color = 'teal')
    plt.xticks(rotation = 90)
    plt.title(title[0], fontdict = {'fontweight': 'bold', 'fontsize' : 24})
    
    if multiple_subplots:
        plt.subplot(212)
        if use_another_data:
            data_to_use = another_data
        else:
            data_to_use = data
        sb.boxplot(x = cat_data[0], y = num_data[0], data = data_to_use, color = 'teal')
        plt.title(title[1], fontdict = {'fontweight': 'bold', 'fontsize' : 24})
        plt.xticks(rotation = 90);
        plt.subplots_adjust(hspace = 1.4)
    plt.xlabel(f'{cat_data[0]}', fontdict = {'weight': 'bold'})
    plt.ylabel(f'{num_data[0]}', fontdict = {'weight': 'bold'})
In [97]:
def full_funding(cat_data, title, fig_size = None):
    
    '''To plot distributions of categorical data in form of counts for fully funded loans
    
    
    cat_data: str, The elements of the categorical data,
    title: str, The title of the plot,
    fig_size: tuple, The size of the figure to plot on
    '''
    if fig_size != None:
        plt.figure(figsize = fig_size)
    full_funding = data_copy[data_copy['PercentFunded'] == 1]
    sb.countplot(x = cat_data, data = full_funding, color = 'brown')
    plt.xticks(rotation = 90);
    plt.title(f'count of {title} that had full percent funding', 
             fontdict = {'fontweight': 'bold', 'fontsize' : 24});
    plt.xlabel(f'{cat_data}', fontdict = {'weight': 'bold'})
    plt.ylabel(f'Count of {cat_data[0]} for loans with full funding', fontdict = {'weight': 'bold'})
In [98]:
not_including_1 = data_copy[data_copy['PercentFunded'] < 1]
boxplots(10, 10, ['LoanStatus'], ['PercentFunded'], data_copy,
         ['Relationship between percentFunded and Loanstatus', 
          'Relationship between percentFunded and Loanstatus excluding 1'],
          multiple_subplots = True, \
         use_another_data = True, another_data = not_including_1)

Loans that are not fully funded with Loan status of past Due(31-60 days) have larger range of PercentFunded than other status, with most having no skew distribution

In [99]:
full_funding('LoanStatus', 'loan status')
In [100]:
not_including_1 = data_copy[data_copy['PercentFunded'] < 1]
boxplots(10, 10, ['ProsperRating (Alpha)'], ['PercentFunded'], data_copy,
         ['Relationship between percentFunded and ProsperRating', 
          'Relationship between percentFunded and ProsperRating excluding 1'],
          multiple_subplots = True, \
         use_another_data = True, another_data = not_including_1)
In [ ]:
 

Rating C has the highest range of percenrt funded not fully funded and HR has the lowest range and 4 have no skewness

In [101]:
full_funding('ProsperRating (Alpha)', 'ProsperRating')

For fully funded loans, Most Loans had rating 'C' and fewer had 'AA'

In [102]:
not_including_1 = data_copy[data_copy['PercentFunded'] < 1]
boxplots(10, 10, ['ListingCategory'], ['PercentFunded'], data_copy,
         ['Relationship between percentFunded and ListingCategory', 
          'Relationship between percentFunded and ListingCategory excluding 1'],
          multiple_subplots = True, \
         use_another_data = True, another_data = not_including_1)

For not fully funded loans, Loans for HomeImprovement, Business, Debt Consolidation, Household Expenses, Medical/Dental and other had largest range of Percent Funded, while Loans for baby and Adoption, Cosmetic Procedures, Green Loans,Not available had the leas range with their values being around 81%, 85%, 71% and 83% respectively

In [103]:
full_funding('ListingCategory', 'Listing Category', (10,4))

For loans filly funded, Loans for Debt Consolidation had the highest frequency.

In [104]:
not_including_1 = data_copy[data_copy['PercentFunded'] < 1]
boxplots(20, 10, ['BorrowerState'], ['PercentFunded'], data_copy,
         ['Relationship between percentFunded and Borrower State', 
          'Relationship between percentFunded and Borrower State excluding 1'],
          multiple_subplots = True, \
         use_another_data = True, another_data = not_including_1)

For not fully loans, Maryland, North Carolina, New Jersey,New York, Washington had the largest range of Percent funded

In [105]:
full_funding('BorrowerState', 'States', (15, 5))

Califrnia had the highest frequency of fully funded loans

In [106]:
boxplots(20, 10, ['Occupation'], ['PercentFunded'], data_copy,
         ['Relationship between percentFunded and Occupation', 
          'Relationship between percentFunded and Occupation excluding 1'],
          multiple_subplots = True, \
         use_another_data = True, another_data = not_including_1)

For not fully funded loans, Car dealer had a low range of Percent Funded but had high funding, Civil Service, Sales-Commission had high Funding range

In [107]:
full_funding('Occupation', 'Occupation', (15,8))

Although Other and Professional had highest frequency of full funding, Teacher, Executive, COmputer Programming also had relatively high frequency

In [108]:
boxplots(20, 10, ['EmploymentStatus'], ['PercentFunded'], data_copy,
         ['Relationship between percentFunded and Employment Status', 
          'Relationship between percentFunded and Employment Status excluding 1'],
          multiple_subplots = True, \
         use_another_data = True, another_data = not_including_1)

For not fully funded loans, Part-time had fundings of 91%, Full-time and employed had higher range

In [109]:
full_funding('EmploymentStatus', 'EmploymentStatus')

Employed and Full time had relatively high frequency of Fully funded loans

In [110]:
boxplots(20, 10, ['IsBorrowerHomeowner'], ['PercentFunded'], data_copy,
         ['Relationship between percentFunded and IsBorrowerHomeowner', 
          'Relationship between percentFunded and IsBorrowerHomeowner excluding 1'],
          multiple_subplots = True, \
         use_another_data = True, another_data = not_including_1)
In [111]:
full_funding('IsBorrowerHomeowner', 'Homeowner')

Home borrowers not-fully funded loans had a higher range of percent funding and higher frequency of full funding

In [112]:
boxplots(20, 10, ['CurrentlyInGroup'], ['PercentFunded'], data_copy,
         ['Relationship between percentFunded and CurrentlyInGroup', 
          'Relationship between percentFunded and CurrentlyInGroup excluding 1'],
          multiple_subplots = True, \
         use_another_data = True, another_data = not_including_1)

Borrowers that did not get full-funding in group had a lower percent funding than those not in groups

In [113]:
full_funding('CurrentlyInGroup', 'borrowers in groups')

Most borrowers not in groups had full funding, but this could be because of the nature of data being analysed, from the exploration of the variable False had alrger proportion than True

In [114]:
boxplots(20, 10, ['LoanOriginationQuarter'], ['PercentFunded'], data_copy,
         ['Relationship between percentFunded and LoanOriginationQuarter', 
          'Relationship between percentFunded and LoanOriginationQuarter excluding 1'],
          multiple_subplots = True, \
         use_another_data = True, another_data = not_including_1)
In [115]:
full_funding('LoanOriginationQuarter', 'Quarter that had loans')

This is similar to tthe distribution of the Quarter Variable, so I Think they have no effect on if it will be fully funded or not}

In [116]:
boxplots(20, 10, ['IncomeVerifiable'], ['PercentFunded'], data_copy,
         ['Relationship between percentFunded and IncomeVerifiable', 
          'Relationship between percentFunded and IncomeVerifiable excluding 1'],
          multiple_subplots = True, \
         use_another_data = True, another_data = not_including_1)
In [117]:
full_funding('IncomeVerifiable', 'verified income')

This is similar to the distribution of theIncome Verifiable Variable, so I Think they have no effect on if it will be fully funded or not, but Most fully funded borrowers had Income Verified

In [118]:
def heatmap(x_data, y_data, number_of_bins, fig_size,more_subplot = False, x_data2 = None):
    
    '''To plot relationship between numeric data and numeric data in terms of density of their frequency
    
   x_data: str, The data to be on the x-axis, 
   y_data: str, The data to be on the y-axis,
   number_of_bins: int, Number of bins, 
   fig_size: tuple, The size of the figure to plot on,
   more_subplot: bool, If there should be another subplot, 
   x_data2 = str, The data to be on the x-axis, if more_subplot is True
    '''
    
    plt.figure(figsize = fig_size)
    plt.subplot(211) if more_subplot else plt.subplot(111)
    no_of_bins_x = not_including_1[x_data].max()/ number_of_bins
    no_of_bins_y = not_including_1[y_data].max()/ number_of_bins
    bins_x = np.arange(not_including_1[x_data].min(), 
                 not_including_1[x_data].max() +  no_of_bins_x, no_of_bins_x)
    bins_y = np.arange(not_including_1[y_data].min(), 
                 not_including_1[y_data].max() +  no_of_bins_y, no_of_bins_y)
    plt.hist2d(x = x_data, y = y_data, data = not_including_1,
           bins = [bins_x, bins_y], cmap = 'viridis_r', )
    plt.title(f'Relationship between percentFunded and {x_data}')
    plt.xlabel(f'{x_data}', fontdict = {'weight': 'bold'})
    plt.ylabel(f'{y_data}', fontdict = {'weight': 'bold'})
    
    if more_subplot:
        plt.subplot(212)
        no_of_bins_x = not_including_1[x_data2].max()/ number_of_bins
        bins_x = np.arange(not_including_1[x_data2].min(), 
                 not_including_1[x_data2].max() +  no_of_bins_x, no_of_bins_x)
        plt.hist2d(x = x_data2, y = y_data, data = not_including_1,
           bins = [bins_x, bins_y], cmap = 'viridis_r', )
        plt.title(f'Relationship between percentFunded and {x_data2}')
        plt.xlabel(f'{x_data2}', fontdict = {'weight': 'bold'})
        plt.ylabel(f'{y_data}', fontdict = {'weight': 'bold'})
    
    plt.subplots_adjust(hspace = 0.9)
    plt.colorbar();
In [119]:
def contvcont(x_data, y_data, number_of_bins, fig_size,more_subplot = False, x_data2 = None):
    
    '''To plot relationship between numeric data and numeric data
    
   x_data: str, The data to be on the x-axis, 
   y_data: str, The data to be on the y-axis,
   number_of_bins: int, Number of bins, 
   fig_size: tuple, The size of the figure to plot on,
   more_subplot: bool, If there should be another subplot, 
   x_data2: str, The data to be on the x-axis, if more_subplot is True
    '''
    
    plt.figure(figsize = fig_size)
    plt.subplot(211) if more_subplot else plt.subplot(111)
    no_of_bins_x = not_including_1[x_data].max()/ number_of_bins
    no_of_bins_y = not_including_1[y_data].max()/ number_of_bins
    bins_x = np.arange(not_including_1[x_data].min(), 
                 not_including_1[x_data].max() +  no_of_bins_x, no_of_bins_x)
    bins_y = np.arange(not_including_1[y_data].min(), 
                 not_including_1[y_data].max() +  no_of_bins_y, no_of_bins_y)
    sb.regplot(x = x_data, y = y_data, data = not_including_1,)
    plt.title(f'Relationship between percentFunded and {x_data}')
    plt.xlabel(f'{x_data}', fontdict = {'weight': 'bold'})
    plt.ylabel(f'{y_data}', fontdict = {'weight': 'bold'})
    
    if more_subplot:
        plt.subplot(212)
        no_of_bins_x = not_including_1[x_data2].max()/ number_of_bins
        bins_x = np.arange(not_including_1[x_data2].min(), 
                 not_including_1[x_data2].max() +  no_of_bins_x, no_of_bins_x)
        sb.regplot(x = x_data, y = y_data, data = not_including_1)
        plt.title(f'Relationship between percentFunded and {x_data2}')
        plt.xlabel(f'{x_data}', fontdict = {'weight': 'bold'})
        plt.ylabel(f'{y_data}', fontdict = {'weight': 'bold'})
    
    plt.subplots_adjust(hspace = 0.9)
In [120]:
def full_fund_hist(data, number_of_bins, fig_size,more_subplot = False, data2 = None, log_plot = False):
    
    '''To plot frequency of numeric data and numeric data for fully funded loans
    
   data: str, The numeric data to plot,
   number_of_bins: int, Number of bins, 
   fig_size: tuple, The size of the figure to plot on,
   more_subplot: bool, If there should be another subplot, 
   data2: str, The data to be on the x-axis, if more_subplot is 
   log_plot: bool, if the second subplot should be a logarithmic plot
    '''
    
    full_funding = data_copy[data_copy['PercentFunded'] == 1]
    plt.figure(figsize = fig_size) 
    plt.subplot(121) if more_subplot else plt.subplot(111)
    no_of_bins = full_funding[data].max()/ number_of_bins
    bins = np.arange(full_funding[data].min(), 
                 full_funding[data].max() +  no_of_bins, no_of_bins)
    sb.histplot(x = data, data = full_funding, bins = bins, stat = 'density')
    plt.title(f'{data} with full percentFunded')
    plt.xlabel(f'{data}', fontdict = {'weight': 'bold'})
    plt.ylabel(f'frequency of {data} for fully funded loans', fontdict = {'weight': 'bold'})
    
    if more_subplot:
        plt.subplot(122)
        if log_plot:
            no_of_bins = np.log(full_funding[data] + 1).max()/ number_of_bins
            bins = np.arange(np.log(full_funding[data] + 1).min(), 
                 np.log(full_funding[data] + 1).max() +  no_of_bins, no_of_bins)
            plt.xscale('log')
            plt.xlim(0.9e0, 2e1)
            plt.xticks([0.9e0, 2e0, 4e0, 6e0, 8e0, 1e1], ['0', '2', '4', '6', '8', '10'])
        else:
            no_of_bins = full_funding[data2].max()/ number_of_bins
            bins = np.arange(full_funding[data2].min(), 
                 full_funding[data2].max() +  no_of_bins, no_of_bins)
        sb.histplot(x = data if log_plot else data2
                    , data = full_funding, bins = bins, stat = 'density')
        plt.xlabel(f'{data}', fontdict = {'weight': 'bold'})
        plt.ylabel(f'frequency of {data} for fully funded loans', fontdict = {'weight': 'bold'})
        
        if log_plot:
            plt.title(f'logarithm plot of {data} with full percentFunded')
        else:
            plt.title(f'logarithm plot of {data2} with full percentFunded')
    plt.subplots_adjust(wspace = 0.9)
In [121]:
contvcont('CreditScoreRangeLower', 'PercentFunded', 50, (10,10),more_subplot = True, x_data2 = 'CreditScoreRangeUpper')

For fully Funded loans, The credit SCore range have no correlation with Percent Funded

In [122]:
full_fund_hist('CreditScoreRangeLower', 50, (10,10),more_subplot = True, data2 = 'CreditScoreRangeUpper')

Most Fully funded loans had credit score of 700

In [123]:
heatmap('TotalCreditLinespast7years', 'PercentFunded', 50, (15, 6))

Most total credit lines were between 10 and 40, but but most credit lines of 20 and 40 had 70% funding

In [124]:
full_fund_hist('TotalCreditLinespast7years', 50, (10,6))

Credit Lines of close to 18 and close to 28 had highes frequency of full funding

In [125]:
full_fund_hist('DebtToIncomeRatio', 50, (10,6), more_subplot = True, log_plot = True)

Debt-to-Income ratio of 10 had the highest fequency of full fended loans

In [126]:
contvcont('DebtToIncomeRatio', 'PercentFunded', 50, (10,8),\
          more_subplot = True, x_data2 = 'CreditScoreRangeUpper')
In [127]:
def correlation(fig_size, data, multiple_subplot =  False):
    
    '''To plot correlation between numeric data and PercentFunded
    
   fig_size: tuple, The size of the figure to plot on,
   data: str, The numeric data to plot, 
   multiple_subplot: bool, If there should be another subplot for not fully funded loans, 
    '''
    
    plt.figure(figsize = fig_size)
    plt.subplot(121) if multiple_subplot else plt.subplots(111)
    sb.regplot(x = data, y = 'PercentFunded', data = data_copy)
    plt.title(f'Relationship between PercentFunded \nand {data}')
    plt.xlabel(f'{data}', fontdict = {'weight': 'bold'})
    plt.ylabel(f'PercentFunded', fontdict = {'weight': 'bold'})
    if multiple_subplot:
        plt.subplot(122)
        sb.regplot(x = data, y = 'PercentFunded', data = not_including_1)
        plt.title(f'Relationship between PercentFunded \nand {data} of loans'
                  'that were not fully funded')
        plt.subplots_adjust(wspace  = 0.9)
        plt.xlabel(f'{data}', fontdict = {'weight': 'bold'})
        plt.ylabel(f'PercentFunded', fontdict = {'weight': 'bold'})
In [128]:
def fully_funded_for_numeric(fig_size, column):
    
    '''To plot frequency of numeric data for fully funded loans
    
    fig_size: tuple, The size of the figure to plot on,
    column: str, The numeric data to plot, 
    '''
    
    plt.figure(figsize = fig_size)
    plt.subplot(111)
    data_to_use = data_copy[data_copy['PercentFunded'] == 1]
    no_of_bins = data_to_use[column].max()/ 50
    bins = np.arange(data_to_use[column].min(), 
                 data_to_use[column].max() +  no_of_bins, no_of_bins)
    sb.histplot(x = column, data = data_to_use, bins = bins)
    plt.xlabel(f'{column}', fontdict = {'weight': 'bold'})
    plt.ylabel(f'Frequency of {column} for fully funded loans', fontdict = {'weight': 'bold'})
In [129]:
fully_funded_for_numeric((10,4), 'DebtToIncomeRatio')

The DebttoIncomeRatio has a very very low correlation with PercentFunded

In [130]:
correlation((10, 8), 'MonthlyLoanPayment', multiple_subplot = True)

For non-full-funded loans, Monthly payment had a reasonable high positive correlation with percentFunded

In [131]:
correlation((10, 8), 'Recommendations', multiple_subplot = True)

For non-fully funded loans, Recommendations had a little negative correlation with percentFunded

In [132]:
def recommendations(column):
    
    '''To plot distribution of numeric data on recommendations for loans not fully funded
    
   column: str, The numeric data to plot, 
    '''
    
    fig, ax = plt.subplots(1,2)
    reco = not_including_1['Recommendations'].unique()
    for i, ax in zip(reco, ax.ravel()):
        no_recommendation = not_including_1[not_including_1['Recommendations'] == i]
        sb.boxplot(x = 'Recommendations', y = column, data = no_recommendation, ax = ax)
        plt.text(x = 1, y = 0.8, s = 'The only number of recommendation of loans that do not have '
                 'full funding are 0 and 1')
        ax.set_xlabel('Recommendation', fontdict = {'weight': 'bold'})
        ax.set_ylabel(f'PercentFunded', fontdict = {'weight': 'bold'})
    plt.title(f'Relationship between Recomendations and {column} of not fully funded loans',
                 fontdict = {'weight':'bold'})
    plt.subplots_adjust(wspace = 0.9)
    
In [133]:
recommendations('PercentFunded')

For Recommendation of 1 most Funding were between 72% and 75%

In [134]:
recommendations('InvestmentFromFriendsCount')

This infers that recommendations of might have a relationship with friends that invest, because recommmendations of 1 has a range of percent funded, though the distribution is negatively skewed

In [135]:
full_funded = data_copy[data_copy['PercentFunded'] == 1]
ax = plt.subplot(111)
prop = full_funded['Recommendations'].value_counts() / full_funded['Recommendations'].shape[0]
label = [str(x) for x in full_funded['Recommendations'].value_counts().index.to_list()]
sb.barplot(x = prop, y = label, ax  = ax)
for i in range(len(label)):
    prop = full_funded['Recommendations'].value_counts().iloc[i] / \
    full_funded['Recommendations'].shape[0]
    percent = prop * 100
    value = full_funded['Recommendations'].value_counts().iloc[i]
    plt.text(y = i, x = prop, s = '{:0.3f}% ,{}counts'.format(percent, value))
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.title('Proportion of Recommendations for fully funded loans', fontdict = {'Weight': 'bold'},
          pad = 7)
plt.xlabel('proportion', fontdict = {'weight':'bold'})
plt.ylabel('Recommendations', fontdict = {'weight':'bold'});

From distribution of Recommendtaions, most loans had no recommendation, therfore It has no effect on PercentFunding

In [136]:
correlation((10, 8), 'InvestmentFromFriendsAmount', multiple_subplot = True)

InvestmentsfromfriendsAMount has very little correlation with PercentFunded, for not fully funded loans

In [137]:
def relation(x_data, y_data, data):
    
    '''To plot relationship between numeric data
    
   x_data: str, The numeric data to plot on x-axis, 
   y_data: str, The numeric data to plot on y-axis, 
   data: DataFrame, The data to plot from
    '''
    
    sb.regplot(x = x_data, y = y_data, data = data);
    plt.title(f'relationship between {x_data} and {y_data}', fontdict = {'weight':'bold'}, pad = 7)
    plt.xlabel(f'{x_data}', fontdict = {'weight':'bold'})
    plt.ylabel(f'{y_data}', fontdict = {'weight':'bold'});
In [138]:
relation('InvestmentFromFriendsCount', 'Investors', data_copy)

High correlattion (expected)

In [139]:
relation('Recommendations', 'InvestmentFromFriendsCount', data_copy)

High correlation between Recommendations and friends that invest

In [140]:
relation('LenderYield', 'PercentFunded', not_including_1)

Little to no correlation between LendeYield and PercentFunded

In [141]:
def cat_rel(column, hue):
    
    '''To plot counts of two categorical varable
    
   x_data: str, The numeric data to plot on x-axis, 
   y_data: str, The numeric data to plot on y-axis, 
   data: DataFrame, The data to plot from
    '''
    
    sb.countplot(x = column, hue = hue, data = data_copy)
    plt.xticks(rotation = 90)
    plt.title(f'Proportion of {column} and {hue}', fontdict = {'weight':'bold'}, pad = 7)
    plt.xlabel(f'{column}', fontdict = {'weight':'bold'})
    plt.ylabel(f'counts of {column} and {hue}', fontdict = {'weight':'bold'});
In [142]:
cat_rel('EmploymentStatus', 'IncomeVerifiable',)

Most Employed and full-time borrowers had ttheir Income Verifiable, while most self-employed did not have their incomes verifiable

In [143]:
cat_rel('ProsperRating (Alpha)', 'CurrentlyInGroup',)

across all ratings, most borrowers were not in groups

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

The correlation between the variable and the other variables was very low almost 0.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

  • High correlation between 'Recommendations'and 'friends that invest'
  • Little to no correlation between LendeYield and PercentFunded

Multivariate Exploration

In [144]:
data_copy.columns 
Out[144]:
Index(['LoanStatus', 'LenderYield', 'ProsperRating (Alpha)', 'ListingCategory',
       'BorrowerState', 'Occupation', 'EmploymentStatus',
       'IsBorrowerHomeowner', 'CurrentlyInGroup', 'CreditScoreRangeLower',
       'CreditScoreRangeUpper', 'TotalCreditLinespast7years',
       'DebtToIncomeRatio', 'IncomeVerifiable', 'LoanOriginationQuarter',
       'MonthlyLoanPayment', 'Recommendations', 'InvestmentFromFriendsCount',
       'InvestmentFromFriendsAmount', 'Investors', 'PercentFunded',
       'investments not from friends'],
      dtype='object')
In [145]:
def vari(col = None, numerical = False, facet = False, minimum = None, pad = None):
    
    '''To plot relationship between LenderYield, PercentFunded and one categorical variable or
     numerical variable
    
    col: str, categorical variable to plot, 
    numerical: str, numerical variable to plot, 
    facet: bool, if it should be facet plot, 
    minimum: int, minimum value for colorbar,
    pad = The pad for title
    '''
    
    plt.figure(figsize = (10,5))
    if numerical:
        plt.scatter(data = not_including_1, x = 'LenderYield', y = 'PercentFunded', c = col, \
           cmap = 'Accent', vmin = minimum)
        plt.colorbar(label = col)
    elif facet:
        g = sb.FacetGrid(data = not_including_1, col = col, col_wrap =3)
        g.map(sb.regplot, 'LenderYield', 'PercentFunded')
        plt.title(f'Relationship between LenderYield, PercentFunded and {col}',
                 fontdict = {'weight': 'bold'}, pad = pad)
    else:
        sb.scatterplot(x = 'LenderYield', y = 'PercentFunded', data = not_including_1,\
                       hue = col, s = 70);
    if not facet:
        plt.title(f'Relationship between LenderYield, PercentFunded and {col}',
                 fontdict = {'weight': 'bold'}, pad  = 10)
        plt.xlabel('LenderYield', fontdict = {'weight':'bold'})
        plt.ylabel('PercentFunded', fontdict = {'weight':'bold'});
    plt.legend(loc = 'right', bbox_to_anchor = (1.5,1))
In [146]:
vari(col = 'EmploymentStatus')

Loans of Most borrowers that are not employed, that have employment status of other have more lender yield, Part-time had high LenderYield and Percent Funded of about 87%, Self-employed has high LenderYield

In [147]:
vari('LoanStatus', facet = True, pad = 650)
No handles with labels found to put in legend.
<Figure size 720x360 with 0 Axes>

even across the LoanStatus feature, the correlation between PercentFunded and lender yiels is neutral or almost neutral.

In [148]:
vari('ProsperRating (Alpha)')

All ratings had the same range of PercentFunded except E that had the least PercentFUnded to be above 70%, From AA to HR there's decrease in LenderYield

In [149]:
vari('CurrentlyInGroup')

Most loans of borrowers in groups had lower Lender Yield, but for not fully funded loans their percentfunded is between 72% and 97%.

In [150]:
vari(col = 'LoanOriginationQuarter', facet= True, pad = 250)
No handles with labels found to put in legend.
<Figure size 720x360 with 0 Axes>

even across the LoanOriginationQuarter feature, the correlation between PercentFunded and lender yiels is neutral or almost neutral.

In [151]:
vari(col = 'IncomeVerifiable')

Most borrowers with Income not verifiabl had high LenderYield

In [152]:
vari(numerical= True, col= 'TotalCreditLinespast7years')

Most loans with totalcredit between 65 and 75 have low Funds percent less than 75%, loans with totalcredit between 35 and 45 have high Funds percent above than 85%

In [153]:
vari(numerical= True, col= 'CreditScoreRangeLower')

Most loans of Credit score between 625 to 725 have higher yields,Most loans of Credit score from730 above had low lender yield

In [154]:
vari(numerical= True, col = 'CreditScoreRangeUpper', minimum = 19)

Most loans of Credit score from 750 and above have lower yields,Most loans of Credit score between 550 and 750 had high lender yield

In [155]:
vari(numerical= True, col = 'MonthlyLoanPayment')

most loans with Monthly Loan payments between 650 and 800 have high PercentFunded and lower LenderYield
most loans with Monthly Loan payments between 300 and 500 have lower
most loans with Monthly Loan payments between 500 to 650 have high PercentFunded and lower LenderYield

In [156]:
def show(xs, y):
    
    '''To plot differnt subplots with different sizes

    xs: list, List of categorical variables to plot,
    y: str, numerical variable to plot agains
    '''
    
    fig = plt.figure(figsize = (10, 5 * len(xs)//2))
    gs = GridSpec(3, 2, figure = fig)
    ax1 = fig.add_subplot(gs[0,0])
    ax2 = fig.add_subplot(gs[0,1])
    ax3 = fig.add_subplot(gs[1,0])
    ax4 = fig.add_subplot(gs[1,1])
    ax5 = fig.add_subplot(gs[2,0:])
    full_funds(fig, xs, y)
    fig.suptitle(f'Relationship between Categorical variables and {y} for fully funded loans',
                fontdict = {'weight': 'bold'})
    plt.show()
In [157]:
def full_funds(fig, xs, y):
    
    '''To plot relationship between numerical variable and categorical variables for fully funded loans
    
    fig: matplotlib.pyplot.figure, The figure to plot on,
    xs: list, List of categorical variables to plot,
    y: str, numerical variable to plot agains
    '''
    
    for i, ax in enumerate(fig.axes):
        if i == 4:
            sb.barplot(x = xs[i], y = y, data = full_funded, color = 'grey', ax = ax, ci = None)
            plt.xticks(rotation = 90)
        else:
            sb.barplot(y = xs[i], x = y, data = full_funded, color = 'grey', ax = ax, ci = None)
    plt.subplots_adjust(wspace = 0.9, hspace = 0.4)
In [158]:
cat = ['ProsperRating (Alpha)', 'LoanStatus', 'ListingCategory','LoanOriginationQuarter', 'BorrowerState']
show(xs = cat, y = 'LenderYield')

For full funding

  • HR rating has the highest lender yield, But AA has the lowest lender yield.
  • Past Due(>120) has the highest lender yield, other past due stauses, Defaulted and chargedoff also have relatively high lender yields above 0.20, while cancelled has the lowest lender yield
  • For the Listing Category, the lender yields are almost the same
  • Loans generated in the second and third quarter have the highest lender yield, but the lender yields across each quarter is almost similar
  • Iowa and Maine has the lowest mean of Lender Yield
In [159]:
show(xs = cat, y = 'MonthlyLoanPayment')

For full funding

  • AA rating has the highest Monthly Loan Payment, But HR has the lowest Monthly Loan Payment.
  • Completed has the highest Monthly Loan Payment, other past due stauses, Defaulted and chargedoff also have relatively high Monthly Loan Payment above 200, while cancelled has the lowest Monthly Loan Payment
  • For the Listing Category, studentuse have the lowest Monthly Loan Payment, and Debtt Consolidation has the highest Monthly Loan Payment
  • Loans generated in the first have the highest Monthly Loan Payment, followed by the fourth quarter
  • Iowa, Nevada and Maine has the lowest mean of Lender Yield
In [ ]:
show(xs = cat, y = 'TotalCreditLinespast7years')

For full funding

  • Across all the catefories their valuse all have the same mean of TotalCreditLines
In [ ]:
show(xs = cat, y = 'CreditScoreRangeUpper')

For full funding

  • Across all the catefories their values all have almos the same mean of CreditScoreRangeUpper
In [ ]:
show(xs = cat, y = 'CreditScoreRangeLower')

For full funding

  • Across all the catefories their values all have almos the same mean of CreditScoreRangLower

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

  • There is a strong relationship between CurrentlyInGroup and lenderYield when plotted with PercentFunded
  • There is a strong relationship between CreditScoreRange and lenderYield when plotted with PercentFunded
  • There is a strong relationship between MonthlyLoanPayment and lenderYield when plotted with PercentFunded

Were there any interesting or surprising interactions between features?

Higher credit score had lower lender yields and lower credit score had higher lender yields

Conclusions

I performed dat wrangling on the data to how tidy and clean the data was. The data was almosttidy as each row was an observation and each column was a variable, there were no duplicates also but there were to columns(Credit grade and prosperRating (Alpha)) that had to be joined.
It was not clean however as the datatypes were not correct and there were so many missing values.I picked the variable of interest and picked variables thet would be necessary in the analysis. I cleaned the data while making sure not to lose so many data. I made exploratory analysis on distribution of all the features I picked, Checked the variation in two variables against each other and even three variables.

Challenges

My variable of interest PerentFunded did not have strong correlation with the other variables. It might be better if more diverse data is collected as most of the observations had PercwntFunding value of 1.0